博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
用webmagic实现一个java爬虫小项目
阅读量:7068 次
发布时间:2019-06-28

本文共 16883 字,大约阅读时间需要 56 分钟。

一、环境

      项目:maven项目

      数据库:mysql

   

 

二、项目介绍

      我们要爬去的页面是https://shimo.im/doc/iKYXMBsZ5x0kui8P

     假设我们需要进入这个页面,爬取页面里面的所有电影百度云链接,并保存在mysql数据库里。

    

      

 

三、pom.xml配置

  首先我们需要新建一个maven项目,并在pom.xml配置如下jar包。

4.0.0
com.jk
shimo
1.0-SNAPSHOT
jar
shimo
http://www.example.com
com.jk.ShiMoChromeProcessor
UTF-8
1.7
1.7
junit
junit
4.11
test
us.codecraft
webmagic-core
0.7.3
us.codecraft
webmagic-extension
0.7.3
org.seleniumhq.selenium
selenium-java
3.0.1
org.seleniumhq.selenium
selenium-chrome-driver
3.0.1
org.seleniumhq.selenium
selenium-remote-driver
3.0.1
com.codeborne
phantomjsdriver
1.2.1
org.apache.commons
commons-exec
1.3
mysql
mysql-connector-java
5.1.6
org.apache.maven.plugins
maven-shade-plugin
2.2
package
shade
com.jk.ShiMoChromeProcessor

 

三、下载谷歌浏览器和谷歌浏览器驱动

      我们这里采用selenium爬去动态网页,这也是目前比较常见的方法。如果不清楚这个,可以提前看看。我们需要用下面代码来模拟一个谷歌浏览器,其中chromebin就是你电脑下载的谷歌浏览器路径,chromedriver是谷歌浏览器驱动,userdata是你下载谷歌浏览器后的User Date文件夹路径。下载路径 链接:https://pan.baidu.com/s/1NnMdRfEXdwBo-ltpP-J4Sw 提取码:jqnx 

WebDriver driver = TestChromeDriver.getChromeDriver(chromebin,chromedriver,userdata);

驱动下载之后随便安装在哪个盘里都可以,但是路径一定要记得。

谷歌浏览器下载之后点击一下,自动帮你安装在C盘,桌面也有图标显示,chromebin和userdata的路径可以通过图标属性找到。

四、将要使用的参数放在config.properties配置文件下

#这里的三个参数就是连接数据库用的 db_url=jdbc:mysql://localhost:3306/ziyuan?useUnicode=true&characterEncoding=utf-8 db_username=root db_password=962464 #这里就是刚刚说的三个路径 chromebin=C:\\Users\\hasee\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe chromedriver=G:\\new\\chromedriver\\chromedriver.exe userdata=C:\\Users\\hasee\\AppData\\Local\\Google\\Chrome\\User Data #数据库表名 db_table=shimo #爬取的链接 guochan=https://shimo.im/doc/iKYXMBsZ5x0kui8P

 五、连接数据库的JavaBean

public class DataSourceModel {
private String url; private String username; private String password; DataSourceModel(){
} public String getUrl() {
return url; } public void setUrl(String url) {
this.url = url; } public String getUsername() {
return username; } public void setUsername(String username) {
this.username = username; } public String getPassword() {
return password; } public void setPassword(String password) {
this.password = password; } }

 六、爬虫保存到数据库的JavaBean

在mysql数据库里创建一个表格

public class Shimo {
private String name; private String url; private String createtime; private String updatetime; private String path; private String rengong; private String type; public String getType() {
return type; } public void setType(String type) {
this.type = type; } public String getName() {
return name; } public void setName(String name) {
this.name = name; } public String getUrl() {
return url; } public void setUrl(String url) {
this.url = url; } public String getCreatetime() {
return createtime; } public void setCreatetime(String createtime) {
this.createtime = createtime; } public String getUpdatetime() {
return updatetime; } public void setUpdatetime(String updatetime) {
this.updatetime = updatetime; } public String getPath() {
return path; } public void setPath(String path) {
this.path = path; } public String getRengong() {
return rengong; } public void setRengong(String rengong) {
this.rengong = rengong; } } 七、Processor类
public class ShiMo2ChromeProcessor implements PageProcessor {
static Properties properties; static DataSourceModel dataSourceModel; static String chromebin; static String chromedriver; static String userdata; static String table; static String runTime; static String quanji; static String guochan; static String oumei; static String yingdan; static String dongmanbl; static String taiguoyuenanyindu; static String hanguo; static String riben; static{
properties=Utils.loadConfig("/config.properties"); dataSourceModel=new DataSourceModel(); dataSourceModel.setUrl(properties.getProperty("db_url")); dataSourceModel.setUsername(properties.getProperty("db_username")); dataSourceModel.setPassword(properties.getProperty("db_password")); chromebin=properties.getProperty("chromebin"); chromedriver=properties.getProperty("chromedriver"); userdata=properties.getProperty("userdata"); table=properties.getProperty("db_table"); runTime=properties.getProperty("runTime"); quanji=properties.getProperty("quanji"); guochan=properties.getProperty("guochan"); oumei=properties.getProperty("oumei"); yingdan=properties.getProperty("yingdan"); dongmanbl=properties.getProperty("dongmanbl"); taiguoyuenanyindu=properties.getProperty("taiguoyuenanyindu"); hanguo=properties.getProperty("hanguo"); riben=properties.getProperty("riben"); } private String keyWord; private Site site = Site .me() .setCharset("UTF-8") .setCycleRetryTimes(3) .setSleepTime(3 * 1000) .addHeader("Connection", "keep-alive") .addHeader("Cache-Control", "max-age=0") .addHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0"); public ShiMo2ChromeProcessor() {
} @Override public Site getSite() {
return site; } @Override public void process(Page page){
WebDriver driver = TestChromeDriver.getChromeDriver(chromebin,chromedriver,userdata); driver.manage().window().maximize();//窗口最大化 try {
driver.get(page.getRequest().getUrl()); Thread.sleep(10000);//10s //查找下一个链接 ananyDetail(driver); driver.quit();//浏览器退出 } catch (Exception e) {
e.printStackTrace(); driver.quit();//浏览器退出 } } public static void ananyDetail(WebDriver driver) throws Exception{
//类型 String type=driver.getTitle(); List
list=driver.findElements(By.className("gutter-author-6748903")); for(WebElement webElement:list){
try {
List font=webElement.findElements(By.tagName("font")); if(font.isEmpty()){
continue; } String font1=webElement.findElements(By.tagName("font")).get(0).getText().trim(); if(font1.startsWith("点")&&!font1.startsWith("点击")){
//分析页面 String text= ""; String name= ""; String pwd= ""; try {
text = webElement.getText().replace("☞","").replace("点",""); // if(text!=null){
// text=text.replace(" ","|"); // } if(text.contains("密码")){
//拆分 String[] nameAndPwd=text.split("密码"); name=nameAndPwd[0]; pwd="密码"+nameAndPwd[nameAndPwd.length-1]; }else{
//不拆分 name=text; pwd=""; } } catch (Exception e) {
e.printStackTrace(); } WebElement aTag= null; try {
aTag = webElement.findElement(By.tagName("a")); } catch (Exception e) {
e.printStackTrace(); } //分析url String url=""; try {
if(aTag!=null){
url=aTag.getAttribute("href"); } } catch (Exception e) {
e.printStackTrace(); } Shimo shimo=new Shimo(); shimo.setPath(driver.getCurrentUrl()); shimo.setName(name.trim()); String prefix=""; if(url.contains("pan.baidu")){
prefix="百度网盘:"; }else{
prefix="链接:"; } shimo.setUrl(prefix+url.trim()+" "+pwd.trim()); shimo.setType(type); saveDb(shimo); } } catch (Exception e) {
e.printStackTrace(); continue; } } } public static void saveDb(Shimo shimo){
Connection connection=null; try {
//入数据库 connection=Utils.getConnection(dataSourceModel); //先查询是否存在 SimpleDateFormat sdf=new SimpleDateFormat("yyyy-MM-dd"); String querySql="select count(1) as totalnum from "+table+" where name='#name'"; querySql=querySql.replace("#name",shimo.getName()); int count=Utils.excuteCountQuery(connection,querySql); if(count<=0){
//插入 String sql="insert into "+table+" (name,url,createtime,path,rengong,type) values ('#name','#url','#createtime','#path','#rengong','#type')"; sql=sql.replace("#name",shimo.getName()) .replace("#url",shimo.getUrl()) .replace("#createtime",sdf.format(new Date())) .replace("#path",shimo.getPath()) .replace("#type",shimo.getType()) .replace("#rengong","0"); Utils.saveDb(connection,sql); }else{
//更新 String updateSql="update "+table+" set url='#url',updatetime='#updatetime',path='#path',type='#type' where name='#name' and rengong='0'"; updateSql=updateSql.replace("#name",shimo.getName()) .replace("#url",shimo.getUrl()) .replace("#updatetime",sdf.format(new Date())) .replace("#type",shimo.getType()) .replace("#path",shimo.getPath()); Utils.saveDb(connection,updateSql); } } catch (Exception e) {
System.out.println("入库失败"); e.printStackTrace(); }finally {
if(connection!=null){
try {
connection.close(); } catch (SQLException e) {
e.printStackTrace(); } } } } public static void main(String[] args){
System.out.println("++++++++系统启动中..."); Map
map=new HashMap<>(); while(true){
System.out.println("++++++++系统运行中..."); SimpleDateFormat simpleDateFormat=new SimpleDateFormat("yyyy-MM-dd"); String today=simpleDateFormat.format(new Date());//今天 SimpleDateFormat sdf=new SimpleDateFormat("HH"); String nowTime=sdf.format(new Date()); //当天没有跑过,且时间到了06点。 //跑过之后,将标识改为true //if((map.get(today)==null||map.get(today)==false)&&runTime.equals(nowTime)){
if(true){
map.put(today,new Boolean(true)); System.out.println("++++++++数据抓取中..."); //早晨6点开始跑 Spider spider1=Spider.create(new ShiMo2ChromeProcessor()); spider1.addUrl(guochan) .setDownloader(new HttpClientDownloader()) .thread(1) .run(); } try {
Thread.sleep(600000);//10分钟跑一次 } catch (InterruptedException e) {
e.printStackTrace(); } } } } 八、TestChromeDriver
public class TestChromeDriver {
public static WebDriver getChromeDriver(String chromebin,String chromedriver,String userdata ) {
/* 设定 chrome启动文件的位置, 若未设定则取默认安装目录的 chrome */ System.setProperty("webdriver.chrome.bin", chromebin); /* 设定 chrome webdirver 的位置 ,若未设定则从path变量读取*/ System.setProperty("webdriver.chrome.driver", chromedriver); ChromeOptions chromeOption=new ChromeOptions(); chromeOption.addArguments("--user-data-dir="+userdata); // chromeOption.addArguments("--headless"); chromeOption.addArguments("--no-sandbox"); WebDriver driver = new ChromeDriver(chromeOption); return driver; } } 九、工具类
public class Utils {
public static Properties loadConfig(String configFile) {
InputStream input = null; Properties properties = new Properties(); try {
input = Utils.class.getResourceAsStream(configFile); properties.load(input); } catch (Exception e) {
System.out.println("配置文件加载失败"); } finally {
if(input != null) {
try {
input.close(); } catch (IOException e) {
e.printStackTrace(); } } } return properties; } public static Connection getConnection(DataSourceModel dataSourceModel){
Connection conn=null; try {
Class.forName("com.mysql.jdbc.Driver"); conn= DriverManager.getConnection(dataSourceModel.getUrl(), dataSourceModel.getUsername(), dataSourceModel.getPassword()); } catch (ClassNotFoundException e) {
e.printStackTrace(); } catch (SQLException e) {
e.printStackTrace(); } return conn; } public static void saveDb(Connection connection,String sql){
Statement statement=null; try {
if(connection!=null){
statement=connection.createStatement(); statement.executeUpdate(sql); } } catch (Exception e) {
e.printStackTrace(); } finally {
try {
if(statement!=null) statement.close(); } catch (SQLException e) {
e.printStackTrace(); } } } public static int excuteCountQuery(Connection connection,String sql){
int rowCount=0; Statement statement=null; ResultSet resultSet=null; try {
statement=connection.createStatement(); resultSet=statement.executeQuery(sql); while(resultSet.next()){
rowCount = resultSet.getInt("totalnum"); } } catch (Exception e) {
e.printStackTrace(); } finally {
try {
if(resultSet!=null) resultSet.close(); if(statement!=null) statement.close(); } catch (SQLException e) {
e.printStackTrace(); } } return rowCount; } }
 

转载于:https://www.cnblogs.com/fangyunchen/p/10260518.html

你可能感兴趣的文章
假装用某米赛尔号的角度看Python面向对象编程
查看>>
RGBA和OPACITY的区别&DISPLAY和VISIBILITY的区别
查看>>
膨胀的template class成员函数
查看>>
【leetcode】102. Binary Tree Level Order Traversal 水平遍历二叉树
查看>>
java中的内存模型
查看>>
Vue 初始化性能优化
查看>>
[LeetCode] Sudoku Solver [Backtracking]
查看>>
js函数调用模式和常用的几个方法
查看>>
zookeeper:集群中实例的数量
查看>>
基于redis实现的锁(用于控制nodejs的并发)
查看>>
js手札--关于AMD的简单分析
查看>>
Elixir Ranch: 一个用于处理套接字的网络库
查看>>
JMS规范及相关实现
查看>>
衡量企业应用数据库性能的6大指标
查看>>
ng的缓存模板的用法
查看>>
Vimium 快捷键指南
查看>>
Javascript MV*模式
查看>>
【JavaScript】【函数】蛛丝马迹
查看>>
Windows Media Center SDK 在 GitHub 上发布
查看>>
原创C# Winform+DevExpress皮肤框架
查看>>