当前位置：首页 > news >正文

网站里自动切换图片怎么做网站后台管理器怎么做

news 2026/4/30 7:36:38

网站里自动切换图片怎么做,网站后台管理器怎么做,黄金网站app在线观看下载10,厦门公司网站制作流程一.Scrapy框架简介何为框架#xff0c;就相当于一个封装了很多功能的结构体#xff0c;它帮我们把主要的结构给搭建好了#xff0c;我们只需往骨架里添加内容就行。scrapy框架是一个为了爬取网站数据#xff0c;提取数据的框架#xff0c;我们熟知爬虫总共有四大部分就相当于一个封装了很多功能的结构体它帮我们把主要的结构给搭建好了我们只需往骨架里添加内容就行。scrapy框架是一个为了爬取网站数据提取数据的框架我们熟知爬虫总共有四大部分请求、响应、解析、存储scrapy框架都已经搭建好了。scrapy是基于twisted框架开发而来twisted是一个流行的事件驱动的python网络框架scrapy使用了一种非阻塞又名异步的代码实现并发的Scrapy之所以能实现异步得益于twisted框架。twisted有事件队列哪一个事件有活动就会执行Scrapy它集成高性能异步下载队列分布式解析持久化等。 1.五大核心组件引擎(Scrapy) 框架核心用来处理整个系统的数据流的流动, 触发事务(判断是何种数据流然后再调用相应的方法)。也就是负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯信号、数据传递等所以被称为框架的核心。调度器(Scheduler) 用来接受引擎发过来的请求并按照一定的方式进行整理排列放到队列中当引擎需要时交还给引擎。可以想像成一个URL抓取网页的网址或者说是链接的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址。下载器(Downloader) 负责下载引擎发送的所有Requests请求并将其获取到的Responses交还给Scrapy Engine(引擎)由引擎交给Spider来处理。Scrapy下载器是建立在twisted这个高效的异步模型上的。爬虫(Spiders) 用户根据自己的需求编写程序用于从特定的网页中提取自己需要的信息即所谓的实体(Item)。用户也可以从中提取出链接让Scrapy继续抓取下一个页面。跟进的URL提交给引擎再次进入Scheduler(调度器)。项目管道(Pipeline) 负责处理爬虫提取出来的item主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。 2.工作流程 Scrapy中的数据流由引擎控制其过程如下: 1用户编写爬虫主程序将需要下载的页面请求requests递交给引擎引擎将请求转发给调度器 2调度实现了优先级、去重等策略调度从队列中取出一个请求交给引擎转发给下载器引擎和下载器中间有中间件作用是对请求加工如对requests添加代理、ua、cookieresponse进行过滤等 3下载器下载页面将生成的响应通过下载器中间件发送到引擎 4 爬虫主程序进行解析这个时候解析函数将产生两类数据一种是items、一种是链接URL其中requests按上面步骤交给调度器items交给数据管道数据管道实现数据的最终处理官方文档英文版https://docs.scrapy.org/en/latest/ http://doc.scrapy.org/en/master/ 中文版https://scrapy-chs.readthedocs.io/zh_CN/latest/intro/overview.html https://www.osgeo.cn/scrapy/topics/architecture.html 二、安装及常用命令介绍 1. 安装 Linuxpip3 install scrapy Windows a. pip3 install wheelb. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twistedc. shift右击进入下载目录执行 pip3 install typed\_ast-1.4.0-cp36-cp36m-win32.whld. pip3 install pywin32e. pip3 install scrapy2.scrapy基本命令行 1创建一个新的项目 scrapy startproject ProjectName2生成爬虫 scrapy genspider SpiderNamewebsite3运行(crawl) # -o output scrapy crawl SpiderName scrapy crawl SpiderName \-o file.json scrapy crawl SpiderName\-o file.csv4检查spider文件是否有语法错误 scrapy check5list返回项目所有spider名称 scrapy list6测试电脑当前爬取速度性能 scrapy bench7scrapy runspider scrapy runspider zufang\_spider.py8编辑spider文件 scrapy edit spider 相当于打开vim模式实际并不好用在IDE中编辑更为合适。9将网页内容下载下来然后在终端打印当前返回的内容相当于 request 和 urllib 方法 scrapy fetch url10将网页内容保存下来并在浏览器中打开当前网页内容直观呈现要爬取网页的内容:　 scrapy view url11进入终端。打开 scrapy 显示台类似ipython可以用来做测试 scrapy shell \[url\]12输出格式化内容 scrapy parse url \[options\]13返回系统设置信息 scrapy settings \[options\] 如 $ scrapy settings \--get BOT\_NAME scrapybot14显示scrapy版本 scrapy version \[\-v\] 后面加 \-v 可以显示scrapy依赖库的版本三、简单实例以麦田租房信息爬取为例网站http://bj.maitian.cn/zfall/PG1 **1.**创建项目 scrapy startproject houseinfo生成项目结构 scrapy.cfg 项目的主配置信息。真正爬虫相关的配置信息在settings.py文件中 items.py 设置数据存储模板用于结构化数据如Django的Model pipelines 数据持久化处理 settings.py 配置文件 spiders 爬虫目录 2.创建爬虫应用程序 cd houseinfo scrapy genspider maitian maitian.com然后就可以在spiders目录下看到我们的爬虫主程序 3.编写爬虫文件步骤2执行完毕后会在项目的spiders中生成一个应用名的py爬虫文件文件源码如下 1 # -\*- coding: utf-8 -\*-2 import scrapy3 4 5 class MaitianSpider(scrapy.Spider): 6 name maitian **# 应用名称** 7 allowed\_domains \[maitian.com\] **#一般注释掉允许爬取的域名如果遇到非该域名的url则爬取不到数据**8 start\_urls \[http://maitian.com/\] **#起始爬取的url列表该列表中存在的url都会被parse进行请求的发送**9 10 #解析函数 11 def parse(self, response): 12 pass我们可以在此基础上根据需求进行编写 1 # -\*- coding: utf-8 -\*-2 import scrapy3 4 class MaitianSpider(scrapy.Spider): 5 name maitian6 start\_urls \[http://bj.maitian.cn/zfall/PG100\]7 8 9 #解析函数 10 def parse(self, response): 11 12 li\_list response.xpath(//div\[classlist\_wrap\]/ul/li) 13 results \[\] 14 for li in li\_list: 15 title li.xpath(./div\[2\]/h1/a/text()).extract\_first().strip() 16 price li.xpath(./div\[2\]/div/ol/strong/span/text()).extract\_first().strip() 17 square li.xpath(./div\[2\]/p\[1\]/span\[1\]/text()).extract\_first().replace(㎡,)　　　　　# 将面积的单位去掉 18 area li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).extract\_first().strip().**split(\\xa0)\[0\]　　# 以空格分隔** 19 adress li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).extract\_first().strip().split(\\xa0)\[2\] 20 21 dict { 22 标题:title, 23 月租金:price, 24 面积:square, 25 区域:area, 26 地址:adress 27 } 28 results.append(dict) 29 30 print(title,price,square,area,adress) 31 return results须知 xpath为scrapy中的解析方式xpath函数返回的为列表列表中存放的数据为Selector类型数据。解析到的内容被封装在Selector对象中需要调用extract()函数将解析的内容从Selector中取出。如果可以保证xpath返回的列表中只有一个列表元素则可以使用extract_first(), 否则必须使用extract() 两者等同都是将列表中的内容提取出来 title li.xpath(‘./div[2]/h1/a/text()’).extract_first().strip() title li.xpath(‘./div[2]/h1/a/text()’)[0].extract()****.strip() 4. 设置修改settings.py配置文件相关配置: 1 #伪装请求载体身份 2 USER\_AGENT Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_12\_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 3 4 #可以忽略或者不遵守robots协议 5 ROBOTSTXT\_OBEY False 5.执行爬虫程序scrapy crawl maitain **爬取全站数据也就是全部页码数据。**本例中总共100页观察页面之间的共性构造通用url 方式一通过占位符构造通用url 1 import scrapy 2 3 class MaitianSpider(scrapy.Spider): 4 name maitian5 start\_urls \[http://bj.maitian.cn/zfall/PG{}.**format(page) for page in range(1,4**)\] **#注意写法**6 7 8 #解析函数9 def parse(self, response): 10 11 li\_list response.xpath(//div\[classlist\_wrap\]/ul/li) 12 results \[\] 13 for li in li\_list: 14 title li.xpath(./div\[2\]/h1/a/text()).extract\_first().strip() 15 price li.xpath(./div\[2\]/div/ol/strong/span/text()).extract\_first().strip() 16 square li.xpath(./div\[2\]/p\[1\]/span\[1\]/text()).extract\_first().replace(㎡,) 17 ** # 也可以通过正则匹配提取出来** 18 area li.xpath(./div\[2\]/p\[2\]/span/text()\[2\])..re(r昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城)\[0\] 19 adress li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).extract\_first().strip().split(\\xa0)\[2\] 20 21 dict { 22 标题:title, 23 月租金:price, 24 面积:square, 25 区域:area, 26 地址:adress 27 } 28 results.append(dict) 29 30 return results如果碰到一个表达式不能包含所有情况的项目解决方式是先分别写表达式最后通过列表相加将所有url合并成一个url列表例如 start\_urls \[http://www.guokr.com/ask/hottest/?page{}.format(n) for n in range(1, 8)\] **** \[http://www.guokr.com/ask/highlight/?page{}.format(m) for m in range(1, 101)\]方式二通过重写start_requests方法获取所有的起始url。不用写start_urls 1 import scrapy 2 3 class MaitianSpider(scrapy.Spider): 4 name maitian5 **6 def start\_requests(self): 7 pages\[\]8 for page in range(90,100):9 urlhttp://bj.maitian.cn/zfall/PG{}.format(page) 10 pagescrapy.Request(url) 11 pages.append(page) 12 return pages** 13 14 #解析函数 15 def parse(self, response): 16 17 li\_list response.xpath(//div\[classlist\_wrap\]/ul/li) 18 19 results \[\] 20 for li in li\_list: 21 title li.xpath(./div\[2\]/h1/a/text()).extract\_first().strip(), 22 price li.xpath(./div\[2\]/div/ol/strong/span/text()).extract\_first().strip(), 23 square li.xpath(./div\[2\]/p\[1\]/span\[1\]/text()).extract\_first().replace(㎡,), 24 area li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).re(r昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城)\[0\], 25 adress li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).extract\_first().strip().split(\\xa0)\[2\] 26 27 dict { 28 标题:title, 29 月租金:price, 30 面积:square, 31 区域:area, 32 地址:adress 33 } 34 results.append(dict) 35 36 return results四、数据****持久化存储基于终端指令的持久化存储基于管道的持久化存储只要是数据持久化存储parse方法必须有返回值也就是return后的内容 1. 基于终端指令的持久化存储执行输出指定格式进行存储将爬取到的数据写入不同格式的文件中进行存储windows终端不能使用txt格式 scrapy crawl 爬虫名称 -o xxx.jsonscrapy crawl 爬虫名称 -o xxx.xmlscrapy crawl 爬虫名称 -o xxx.csv 以麦田为例spider中的代码不变将返回值写到qiubai.csv中。本地没有就会自己创建一个。本地有就会追加 scrapy crawl maitian -o maitian.csv就会在项目目录下看到生成的文件查看文件内容 **2.**基于管道的持久化存储 scrapy框架中已经为我们专门集成好了高效、便捷的持久化操作功能我们直接使用即可。要想使用scrapy的持久化操作功能我们首先来认识如下两个文件 items.py数据结构模板文件。定义数据属性。pipelines.py管道文件。接收数据items进行持久化操作。持久化流程 ①　爬虫文件爬取到数据解析后需要将数据封装到items对象中。 ②　**使用yield关键字将items对象提交给pipelines管道**进行持久化操作。 ③　在管道文件中的process_item方法中接收爬虫文件提交过来的item对象然后编写持久化存储的代码**将item对象中存储的数据进行持久化存储在管道的process_item方法中执行io操作,进行持久化存储** ④　settings.py配置文件中开启管道 2.1****保存到本地的持久化存储爬虫文件**:maitian.py** 1 import scrapy2 from houseinfo.items import HouseinfoItem **# 将item导入** 3 4 class MaitianSpider(scrapy.Spider): 5 name maitian6 start\_urls \[http://bj.maitian.cn/zfall/PG100\]7 8 #解析函数9 def parse(self, response): 10 11 li\_list response.xpath(//div\[classlist\_wrap\]/ul/li) 12 13 for li in li\_list: 14 **item HouseinfoItem(** 15 title li.xpath(./div\[2\]/h1/a/text()).extract\_first().strip(), 16 price li.xpath(./div\[2\]/div/ol/strong/span/text()).extract\_first().strip(), 17 square li.xpath(./div\[2\]/p\[1\]/span\[1\]/text()).extract\_first().replace(㎡,), 18 area li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).extract\_first().strip().split(\\xa0)\[0\], 19 adress li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).extract\_first().strip().split(\\xa0)\[2\] 20 ) 21 22 yield item ** # 提交给管道然后管道定义存储方式**items文件items.py 1 import scrapy 2 3 class HouseinfoItem(scrapy.Item): 4 title scrapy.Field() #存储标题里面可以存储任意类型的数据 5 price scrapy.Field() 6 square scrapy.Field() 7 area scrapy.Field() 8 adress scrapy.Field()管道文件pipelines.py 1 class HouseinfoPipeline(object):2 def \_\_init\_\_(self):3 self.file None 4 5 #开始爬虫时执行一次6 def open\_spider(self,**spider**):7 self.file open(maitian.csv,**a**,encodingutf-8) # 选用了追加模式8 self.file.write(,.join(\[标题,月租金,面积,区域,地址,\\n\]))9 print(开始爬虫) 10 11 # 因为该方法会被执行调用多次所以文件的开启和关闭操作写在了另外两个只会各自执行一次的方法中。 12 def process\_item(self, item, spider): 13 content \[item\[title\], item\[price\], item\[square\], item\[area\], item\[adress\], \\n\] 14 self.file.write(,.join(content)) 15 return item 16 17 # 结束爬虫时执行一次 18 def close\_spider(self,**spider**): 19 self.file.close() 20 print(结束爬虫)配置文件settings.py 1 #伪装请求载体身份2 USER\_AGENT Mozilla/5.0 (Macintosh; Intel Mac OS X 10\_12\_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 3 4 #可以忽略或者不遵守robots协议5 ROBOTSTXT\_OBEY False 6 7 **#开启管道 ** 8 ITEM\_PIPELINES { 9 houseinfo.pipelines.HouseinfoPipeline: 300, #数值300表示为优先级值越小优先级越高 10 }五、爬取多级页面爬取多级页面会遇到2个问题问题1如何对下一层级页面发送请求答在每一个解析函数的末尾通过Request方法对下一层级的页面手动发起请求 **# 先提取二级页面url再对二级页面发送请求。多级页面以此类推** def parse(self, response):next\_url \ response.xpath(//div\[2\]/h2/a/href).extract()\[0\] # 提取二级页面urlyield scrapy.Request(urlnext\_url, callbackself.next\_parse) **# 对二级页面发送请求注意要用yield回调函数不带括号**问题2解析的数据不在同一张页面中最终如何将数据传递答涉及到请求传参可以在对下一层级页面发送请求的时候通过meta参数进行数据传递meta字典就会传递给回调函数的response参数。下一级的解析函数通过response获取item先通过 response.meta返回接收到的meta字典再获得item字典 # 通过meta参数进行Request的数据传递meta字典就会传递给回调函数的response参数 def parse(self, response):item \ Item() # 实例化item对象Item\[field1\] response.xpath(expression1).extract()\[0\] # 列表中只有一个元素Item\[field2\] response.xpath(expression2).extract() # 列表next\_url response.xpath(expression3).extract()\[0\] # 提取二级页面url# meta参数:请求传参.通过meta参数进行Request的数据传递meta字典就会传递给回调函数的response参数yield scrapy.Request(urlnext\_url, callbackself.next\_parse,**meta{item:item}**) # 对二级页面发送请求def next\_parse(self,response):# 通过response获取item. 先通过 response.meta返回接收到的meta字典,再获得item字典item **response.meta\[item****\]**item\[field\] response.xpath(expression).extract\_first()yield item #提交给管道案例1麦田对所有页码发送请求。不推荐将每一个页码对应的url存放到爬虫文件的起始url列表start_urls中。这里我们使用Request方法手动发起请求。 # -\*- coding: utf-8 -\*- import scrapy from houseinfo.items import HouseinfoItem # 将item导入class MaitianSpider(scrapy.Spider):name \ maitianstart\_urls \ \[http://bj.maitian.cn/zfall/PG1\]#爬取多页page 1url \ http://bj.maitian.cn/zfall/PG%d#解析函数def parse(self, response):li\_list \ response.xpath(//div\[classlist\_wrap\]/ul/li)for li in li\_list:item \ HouseinfoItem(title \ li.xpath(./div\[2\]/h1/a/text()).extract\_first().strip(),price \ li.xpath(./div\[2\]/div/ol/strong/span/text()).extract\_first().strip(),square \ li.xpath(./div\[2\]/p\[1\]/span\[1\]/text()).extract\_first().replace(㎡,),area \ li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).re(r昌平|朝阳|东城|大兴|丰台|海淀|石景山|顺义|通州|西城)\[0\], # 也可以通过正则匹配提取出来adress li.xpath(./div\[2\]/p\[2\]/span/text()\[2\]).extract\_first().strip().split(\\xa0)\[2\])\[http://bj.maitian.cn/zfall/PG{}.format(page) for page in range(1, 4)\]yield item # 提交给管道然后管道定义存储方式if self.page 4:self.page 1new\_url \ format(self.url%self.page) # 这里的%是拼接的意思yield scrapy.Request(urlnew\_url,callbackself.parse) # 手动发起一个请求注意一定要写yield案例2这个案例比较好的一点是parse函数既有对下一页的回调又有对详情页的回调 import scrapyclass QuotesSpider(scrapy.Spider):name \ quotes\_2\_3start\_urls \ \[http://quotes.toscrape.com,\]allowed\_domains \ \[toscrape.com,\]def parse(self,response):for quote in response.css(div.quote):yield{quote: quote.css(span.text::text).extract\_first(),author: quote.css(small.author::text).extract\_first(),tags: quote.css(div.tags a.tag::text).extract(),}author\_page \ response.css(small.authora::attr(href)).extract\_first()authro\_full\_url \ response.urljoin(author\_page)yield scrapy.Request(**authro\_full\_url, callbackself.parse\_author**) ** # 对详情页发送请求回调详情页的解析函数**next\_page \ response.css(**li.next a::attr(href)**).extract\_first()　　　　　　**# 通过css选择器定位到下一页**if next\_page is not None:next\_full\_url \ response.urljoin(next\_page)yield scrapy.Request(**next\_full\_url, callbackself.pars**e) 　　 **# 对下一页发送请求回调自己的解析函数**def parse\_author(self,response):yield{author: response.css(.author-title::text).extract\_first(),author\_born\_date: response.css(.author-born-date::text).extract\_first(),author\_born\_location: response.css(.author-born-location::text).extract\_first(),authro\_description: response.css(.author-born-location::text).extract\_first(),**案例3**爬取www.id97.com电影网将一级页面中的电影名称类型评分二级页面中的上映时间导演片长进行爬取。多级页面传参 # -\*- coding: utf-8 -\*- import scrapy from moviePro.items import MovieproItem class MovieSpider(scrapy.Spider):name \ movieallowed\_domains \ \[www.id97.com\]start\_urls \ \[http://www.id97.com/\]def parse(self, response):div\_list \ response.xpath(//div\[classcol-xs-1-5 movie-item\])for div in div\_list:item \ MovieproItem() item\[name\] div.xpath(.//h1/a/text()).extract\_first()item\[score\] div.xpath(.//h1/em/text()).extract\_first()item\[kind\] div.xpath(.//div\[classotherinfo\]).xpath(string(.)).extract\_first()item\[detail\_url\] div.xpath(./div/a/href).extract\_first()#meta参数:请求传参.通过meta参数进行Request的数据传递meta字典就会传递给回调函数的response参数yield scrapy.Request(urlitem\[detail\_url\],callbackself.parse\_detail,meta{item:item}) def parse\_detail(self,response):#通过response获取item. 先通过 response.meta返回接收到的meta字典,再获得item字典item response.meta\[item\]item\[actor\] response.xpath(//div\[classrow\]//table/tr\[1\]/a/text()).extract\_first()item\[time\] response.xpath(//div\[classrow\]//table/tr\[7\]/td\[2\]/text()).extract\_first()item\[long\] response.xpath(//div\[classrow\]//table/tr\[8\]/td\[2\]/text()).extract\_first()yield item #提交item到管道**案例4稍复杂可参考链接进行理解**https://github.com/makcyun/web_scraping_with_python/tree/master/https://www.cnblogs.com/sanduzxcvbnm/p/10277414.html 1 #!/user/bin/env python2 3 4 爬取豌豆荚网站所有分类下的全部 app5 数据爬取包括两个部分6 一数据指标7 1 爬取首页8 2 爬取第2页开始的 ajax 页9 二图标10 使用class方法下载首页和 ajax 页11 分页循环两种爬取思路12 指定页数进行for 循环和不指定页数一直往下爬直到爬不到内容为止13 1 for 循环14 15 16 import scrapy 17 from wandoujia.items import WandoujiaItem 18 19 import requests 20 from pyquery import PyQuery as pq 21 import re 22 import csv 23 import pandas as pd 24 import numpy as np 25 import time 26 import pymongo 27 import json 28 import os 29 from urllib.parse import urlencode 30 import random 31 import logging 32 33 logging.basicConfig(filenamewandoujia.log,filemodew,levellogging.DEBUG,format%(asctime)s %(message)s,datefmt%Y/%m/%d %I:%M:%S %p)34 # https://juejin.im/post/5aee70105188256712786b7f35 logging.warning(warn message)36 logging.error(error message)37 38 39 class WandouSpider(scrapy.Spider): 40 name wandou41 allowed\_domains \[www.wandoujia.com\]42 start\_urls \[http://www.wandoujia.com/\]43 44 def \_\_init\_\_(self):45 self.cate\_url https://www.wandoujia.com/category/app46 # 首页url47 self.url https://www.wandoujia.com/category/48 # ajax 请求url49 self.ajax\_url https://www.wandoujia.com/wdjweb/api/category/more?50 # 实例化分类标签51 self.wandou\_category Get\_category() 52 53 def start\_requests(self): 54 yield scrapy.Request(self.cate\_url,callbackself.get\_category)55 56 def get\_category(self,response): 57 # # num 058 cate\_content self.wandou\_category.parse\_category(response) 59 for item in cate\_content: 60 child\_cate item\[child\_cate\_codes\]61 for cate in child\_cate: 62 cate\_code item\[cate\_code\]63 cate\_name item\[cate\_name\]64 child\_cate\_code cate\[child\_cate\_code\]65 child\_cate\_name cate\[child\_cate\_name\]66 67 68 # # 单类别下载69 # cate\_code 502970 # child\_cate\_code 83771 # cate\_name 通讯社交72 # child\_cate\_name 收音机73 74 # while循环75 page 1 # 设置爬取起始页数76 print(\* \* 50)77 78 # # for 循环下一页79 # pages \[\]80 # for page in range(1,3):81 # print(正在爬取%s-%s 第 %s 页 %82 # (cate\_name, child\_cate\_name, page))83 logging.debug(正在爬取%s-%s 第 %s 页 %84 (cate\_name, child\_cate\_name, page))85 86 if page 1:87 # 构造首页url88 category\_url {}{}\_{} .format(self.url, cate\_code, child\_cate\_code) 89 else:90 params { 91 catId: cate\_code, # 大类别92 subCatId: child\_cate\_code, # 小类别93 page: page,94 }95 category\_url self.ajax\_url urlencode(params) 96 97 dict {page:page,cate\_name:cate\_name,cate\_code:cate\_code,child\_cate\_name:child\_cate\_name,child\_cate\_code:child\_cate\_code}98 99 yield scrapy.Request(category\_url,callbackself.parse,metadict) 100 101 # # for 循环方法 102 # pa yield scrapy.Request(category\_url,callbackself.parse,metadict) 103 # pages.append(pa) 104 # return pages 105 106 def parse(self, response): 107 if len(response.body) 100: # 判断该页是否爬完数值定为100是因为无内容时长度是87 108 page response.meta\[page\] 109 cate\_name response.meta\[cate\_name\] 110 cate\_code response.meta\[cate\_code\] 111 child\_cate\_name response.meta\[child\_cate\_name\] 112 child\_cate\_code response.meta\[child\_cate\_code\] 113 114 if page 1: 115 contents response 116 else: 117 jsonresponse json.loads(response.body\_as\_unicode()) 118 contents jsonresponse\[data\]\[content\] 119 # response 是json,json内容是htmlhtml 为文本不能直接使用.css 提取要先转换 120 contents scrapy.Selector(textcontents, typehtml) 121 122 contents contents.css(.card) 123 for content in contents: 124 # num 1 125 item WandoujiaItem() 126 item\[cate\_name\] cate\_name 127 item\[child\_cate\_name\] child\_cate\_name 128 item\[app\_name\] self.clean\_name(content.css(.name::text).extract\_first()) 129 item\[install\] content.css(.install-count::text).extract\_first() 130 item\[volume\] content.css(.meta span:last-child::text).extract\_first() 131 item\[comment\] content.css(.comment::text).extract\_first().strip() 132 item\[icon\_url\] self.get\_icon\_url(content.css(.icon-wrap a img),page) 133 yield item 134 135 # 递归爬下一页 136 page 1 137 params { 138 catId: cate\_code, # 大类别 139 subCatId: child\_cate\_code, # 小类别 140 page: page, 141 } 142 ajax\_url self.ajax\_url urlencode(params) 143 144 dict {page:page,cate\_name:cate\_name,cate\_code:cate\_code,child\_cate\_name:child\_cate\_name,child\_cate\_code:child\_cate\_code} 145 yield scrapy.Request(ajax\_url,callbackself.parse,metadict) 146 147 148 149 # 名称清除方法1 去除不能用于文件命名的特殊字符 150 def clean\_name(self, name): 151 rule re.compile(r\[\\/\\\\\\:\\\*\\?\\\\\\\\|\]) # / \\ : \* ? |) 152 name re.sub(rule, , name) 153 return name 154 155 def get\_icon\_url(self,item,page): 156 if page 1: 157 if item.css(::attr(src)).extract\_first().startswith(https): 158 url item.css(::attr(src)).extract\_first() 159 else: 160 url item.css(::attr(data-original)).extract\_first() 161 # ajax页url提取 162 else: 163 url item.css(::attr(data-original)).extract\_first() 164 165 # if url: # 不要在这里添加url存在判断否则空url 被过滤掉导致编号对不上 166 return url 167 168 169 # 首先获取主分类和子分类的数值代码 # # # # # # # # # # # # # # # # 170 class Get\_category(): 171 def parse\_category(self, response): 172 category response.css(.parent-cate) 173 data \[{ 174 cate\_name: item.css(.cate-link::text).extract\_first(), 175 cate\_code: self.get\_category\_code(item), 176 child\_cate\_codes: self.get\_child\_category(item), 177 } for item in category\] 178 return data 179 180 # 获取所有主分类标签数值代码 181 def get\_category\_code(self, item): 182 cate\_url item.css(.cate-link::attr(href)).extract\_first() 183 184 pattern re.compile(r.\*/(\\d)) # 提取主类标签代码 185 cate\_code re.search(pattern, cate\_url) 186 return cate\_code.group(1) 187 188 # 获取所有子分类标签数值代码 189 def get\_child\_category(self, item): 190 child\_cate item.css(.child-cate a) 191 child\_cate\_url \[{ 192 child\_cate\_name: child.css(::text).extract\_first(), 193 child\_cate\_code: self.get\_child\_category\_code(child) 194 } for child in child\_cate\] 195 196 return child\_cate\_url 197 198 # 正则提取子分类 199 def get\_child\_category\_code(self, child): 200 child\_cate\_url child.css(::attr(href)).extract\_first() 201 pattern re.compile(r.\*\_(\\d)) # 提取小类标签编号 202 child\_cate\_code re.search(pattern, child\_cate\_url) 203 return child\_cate\_code.group(1) 204 205 # # 可以选择保存到txt 文件 206 # def write\_category(self,category): 207 # with open(category.txt,a,encodingutf\_8\_sig,newline) as f: 208 # w csv.writer(f) 209 # w.writerow(category.values())View Code 以上4个案例都只贴出了爬虫主程序脚本因篇幅原因所以item、pipeline和settings等脚本未贴出可参考上面案例进行编写。六**、Scrapy发送post请求** **问题**在之前代码中我们从来没有手动的对start_urls列表中存储的起始url进行过请求的发送但是起始url的确是进行了请求的发送那这是如何实现的呢 **解答**其实是因为爬虫文件中的爬虫类继承到了Spider父类中的start_requestsself这个方法该方法就可以对start_urls列表中的url发起请求 def start\_requests(self):for u in self.start\_urls:yield scrapy.Request(urlu,callbackself.parse) **注意****该方法默认的实现是对起始的url发起get请求如果想发起post请求则需要子类重写该方法。不过**一般情况下不用scrapy发post请求用request模块。例爬取百度翻译 # -\*- coding: utf-8 -\*- import scrapyclass PostSpider(scrapy.Spider):name \ post# allowed\_domains \[www.xxx.com\]start\_urls \[https://fanyi.baidu.com/sug\]def start\_requests(self):data \ { # post请求参数kw:dog}for url in self.start\_urls:yield **scrapy.FormRequest**(urlurl,formdatadata,callbackself.parse) # 发送post请求def parse(self, response):print(response.text)七、设置日志等级 - 在使用scrapy crawl spiderFileName运行程序时在终端里打印输出的就是scrapy的日志信息。 - 日志信息的种类 ERROR 一般错误 WARNING : 警告 INFO : 一般的信息 DEBUG 调试信息 - 设置日志信息指定输出在settings.py配置文件中加入 LOG\_LEVEL ‘指定日志信息种类’即可。LOG\_FILE log.txt则表示将日志信息写入到指定文件中进行存储。其他常用设置 BOT\_NAME 默认“scrapybot”使用startproject命令创建项目时其被自动赋值CONCURRENT\_ITEMS 默认为100Item Process(即Item Pipeline)同时处理每个response的item时最大值CONCURRENT\_REQUEST 默认为16scrapy downloader并发请求concurrent requests的最大值LOG\_ENABLED 默认为True,是否启用loggingDEFAULT\_REQUEST\_HEADERS 默认如下{Accept: text/html,application/xhtmlxml,application/xml;q0.9,\*/\*;q0.8, Accept-Language: en,} scrapy http request使用的默认headerLOG\_ENCODING 默认utt\-8logging中使用的编码LOG\_LEVEL 默认“DEBUG”log中最低级别可选级别有CRITICAL,ERROR,WARNING,DEBUGUSER\_AGENT 默认“Scrapy/VERSION(....)”爬取的默认User-Agent,除非被覆盖 COOKIES\_ENABLED\False禁用cookies八、同时运行多个爬虫实际开发中通常在同一个项目里会有多个爬虫多个爬虫的时候是怎么将他们运行起来呢运行单个爬虫 import sys from scrapy.cmdline import executeif \_\_name\_\_ \_\_main\_\_:execute(\[scrapy,crawl,maitian,\--nolog\])然后运行py文件即可运行名为‘maitian‘的爬虫同时运行多个爬虫步骤如下 \- 在spiders同级创建任意目录如commands \- 在其中创建 crawlall.py 文件此处文件名就是自定义的命令 \- 在settings.py 中添加配置 COMMANDS\_MODULE 项目名称.目录名称 \- 在项目目录执行命令scrapy crawlall crawlall.py代码1 from scrapy.commands import ScrapyCommand2 from scrapy.utils.project import get\_project\_settings3 4 class Command(ScrapyCommand):5 6 requires\_project True7 8 def syntax(self):9 return \[options\] 10 11 def short\_desc(self): 12 return Runs all of the spiders 13 14 def run(self, args, opts): 15 spider\_list self.crawler\_process.spiders.list() 16 for name in spider\_list: 17 self.crawler\_process.crawl(name, \*\*opts.\_\_dict\_\_) 18 self.crawler\_process.start()如有侵权请联系删除。

查看全文

http://www.hkea.cn/news/14472247/