当前位置：首页 > news >正文

诸城网站建设如何建设一个电子商务网站

news 2026/4/25 21:40:24

诸城网站建设,如何建设一个电子商务网站,温州网站开发公司,兴文移动网站建设目录#xff1a;1 数据持久化存储#xff0c;写入Mysql数据库①定义结构化字段#xff1a;②重新编写爬虫文件#xff1a;③编写管道文件#xff1a;④辅助配置#xff08;修改settings.py文件#xff09;#xff1a;⑤navicat创库建表#xff1a;⑥ 效果如下#xf… 目录1 数据持久化存储写入Mysql数据库①定义结构化字段②重新编写爬虫文件③编写管道文件④辅助配置修改settings.py文件⑤navicat创库建表⑥ 效果如下1 数据持久化存储写入Mysql数据库 ①定义结构化字段 items.py文件的编写 # -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass NovelItem(scrapy.Item):匹配每个书籍URL并解析获取一些信息创建的字段# define the fields for your item here like:# name scrapy.Field()category scrapy.Field()book_name scrapy.Field()author scrapy.Field()status scrapy.Field()book_nums scrapy.Field()description scrapy.Field()c_time scrapy.Field()book_url scrapy.Field()catalog_url scrapy.Field()class ChapterItem(scrapy.Item):从每个小说章节列表页解析当前小说章节列表一些信息所创建的字段# define the fields for your item here like:# name scrapy.Field()chapter_list scrapy.Field()class ContentItem(scrapy.Item):从小说具体章节里解析当前小说的当前章节的具体内容所创建的字段# define the fields for your item here like:# name scrapy.Field()content scrapy.Field()chapter_url scrapy.Field()②重新编写爬虫文件将解析的数据对应到字段里并将其yield返回给管道文件pipelines.py # -*- coding: utf-8 -*- import datetimeimport scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rulefrom ..items import NovelItem,ChapterItem,ContentItemclass Bh3Spider(CrawlSpider):name zhallowed_domains [book.zongheng.com]start_urls [https://book.zongheng.com/store/c0/c0/b0/u1/p1/v0/s1/t0/u0/i1/ALL.html]rules (# Rule定义爬取规则 1.提取urlLinkExtractor对象 2.形成请求 3.响应的处理规则# 源码Rule(LinkExtractor(allowrItems/), callbackparse_item, followTrue)# 1.LinkExractor是scrapy框架定义的一个类它定义如何从每个已爬网页面中提取url链接,并将这些url作为新的请求发送给引擎# 引擎经过一系列操作后将response给到callback所指的回调函数。# allowrItems/的意思是提取链接的正则表达式【相当于findall(rItems/,response.text)】# 2.callbackparse_item是指定回调函数。# 3.followTrue的作用LinkExtractor提取到的url所生成的response在给callback的同时还要交给rules匹配所有的Rule规则有几条遵循几条# 拿到了书籍的url 回调函数 process_links用于处理LinkExtractor匹配到的链接的回调函数# 匹配每个书籍的urlRule(LinkExtractor(allowrhttps://book.zongheng.com/book/\d.html,restrict_xpaths(//div[classbookname])), callbackparse_book, followTrue,process_linksprocess_booklink),# 匹配章节目录的urlRule(LinkExtractor(allowrhttps://book.zongheng.com/showchapter/\d.html,restrict_xpaths(//div[classfr link-group])), callbackparse_catalog, followTrue),# 章节目录的url生成的response再来进行具体章节内容的url的匹配之后此url会形成response交给callback函数Rule(LinkExtractor(allowrhttps://book.zongheng.com/chapter/\d/\d.html,restrict_xpaths(//ul[classchapter-list clearfix])), callbackget_content,followFalse, process_linksprocess_chapterlink),# restrict_xpaths是LinkExtractor里的一个参数。作用过滤对前面allow匹配到的url进行区域限制只允许此参数匹配的allow允许的url通过此规则)def process_booklink(self, links):for index, link in enumerate(links):# 限制一本书if index 0:print(限制一本书, link.url)yield linkelse:returndef process_chapterlink(self, links):for index,link in enumerate(links):#限制21章内容if index20:print(限制20章内容,link.url)yield linkelse:returndef parse_book(self, response):print(解析book_url)# 字数book_nums response.xpath(//div[classnums]/span/i/text()).extract()[0]# 书名book_name response.xpath(//div[classbook-name]/text()).extract()[0].strip()category response.xpath(//div[classbook-label]/a/text()).extract()[1]author response.xpath(//div[classau-name]/a/text()).extract()[0]status response.xpath(//div[classbook-label]/a/text()).extract()[0]description .join(response.xpath(//div[classbook-dec Jbook-dec hide]/p/text()).extract())c_time datetime.datetime.now()book_url response.urlcatalog_url response.css(a).re(https://book.zongheng.com/showchapter/\d.html)[0]itemNovelItem()item[category]categoryitem[book_name]book_nameitem[author]authoritem[status]statusitem[book_nums]book_numsitem[description]descriptionitem[c_time]c_timeitem[book_url]book_urlitem[catalog_url]catalog_urlyield itemdef parse_catalog(self, response):print(解析章节目录, response.url) # response.url就是数据的来源的url# 注意章节和章节的url要一一对应a_tags response.xpath(//ul[classchapter-list clearfix]/li/a)chapter_list []for index, a in enumerate(a_tags):title a.xpath(./text()).extract()[0]chapter_url a.xpath(./href).extract()[0]ordernum index 1c_time datetime.datetime.now()catalog_url response.urlchapter_list.append([title, ordernum, c_time, chapter_url, catalog_url])itemChapterItem()item[chapter_list]chapter_listyield itemdef get_content(self, response):content .join(response.xpath(//div[classcontent]/p/text()).extract())chapter_url response.urlitemContentItem()item[content]contentitem[chapter_url]chapter_urlyield item③编写管道文件 pipelines.py文件数据存储到MySql数据库分三步走 ①存储小说信息 ②存储除了章节具体内容以外的章节信息因为首先章节信息是有序的其次章节具体内容是在一个新的页面里需要发起一次新的请求 ③更新章节具体内容信息到第二步的表中。 # -*- coding: utf-8 -*-# Define your item pipelines here # # Dont forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysql import logging from .items import NovelItem,ChapterItem,ContentItem loggerlogging.getLogger(__name__) #生成以当前文件名命名的logger对象。用日志记录报错。class ZonghengPipeline(object):def open_spider(self,spider):# 连接数据库data_config spider.settings[DATABASE_CONFIG]if data_config[type] mysql:self.conn pymysql.connect(**data_config[config])self.cursor self.conn.cursor()def process_item(self, item, spider):# 写入数据库if isinstance(item,NovelItem):#写入书籍信息sqlselect id from novel where book_name%s and author%sself.cursor.execute(sql,(item[book_name],item[author]))if not self.cursor.fetchone(): #.fetchone()获取上一个查询结果集。在python中如果没有则为Nonetry:#如果没有获得一个id小说不存在才进行写入操作sqlinsert into novel(category,book_name,author,status,book_nums,description,c_time,book_url,catalog_url)\values(%s,%s,%s,%s,%s,%s,%s,%s,%s)self.cursor.execute(sql,(item[category],item[book_name],item[author],item[status],item[book_nums],item[description],item[c_time],item[book_url],item[catalog_url],))self.conn.commit()except Exception as e: #捕获异常并日志显示self.conn.rollback()logger.warning(小说信息错误!url%s %s)%(item[book_url],e)return itemelif isinstance(item,ChapterItem):#写入章节信息try:sqlinsert into chapter (title,ordernum,c_time,chapter_url,catalog_url)\values(%s,%s,%s,%s,%s)#注意此处item的形式是 item[chapter_list][(title,ordernum,c_time,chapter_url,catalog_url)]chapter_listitem[chapter_list]self.cursor.executemany(sql,chapter_list) #.executemany()的作用一次操作写入多个元组的数据。形如.executemany(sql,[(),()])self.conn.commit()except Exception as e:self.conn.rollback()logger.warning(章节信息错误!%s%e)return itemelif isinstance(item,ContentItem):try:sqlupdate chapter set content%s where chapter_url%scontentitem[content]chapter_urlitem[chapter_url]self.cursor.execute(sql,(content,chapter_url))self.conn.commit()except Exception as e:self.conn.rollback()logger.warning(章节内容错误!url%s %s) % (item[chapter_url], e)return itemdef close_spider(self,spider):# 关闭数据库self.cursor.close()self.conn.close()④辅助配置修改settings.py文件第一个关闭robots协议第二个开启延迟第三个加入头文件第四个开启管道第五个配置连接Mysql数据库的参数 DATABASE_CONFIG{type:mysql,config:{host:localhost,port:3306,user:root,password:123456,db:zongheng,charset:utf8} }⑤navicat创库建表 1创库 2建表注意总共需要建两张表存储小说基本信息的表表名为novel 存储小说具体章节内容的表表名为chapter: 注意id不要忘记设自增长了 ⑥ 效果如下拓展操作如果来回调试有问题的话需要删除表中所有数据重新爬取直接使用navicate删除表中所有数据即delete操作那么id自增长就不会从1开始了。这该咋办呢

查看全文

http://www.hkea.cn/news/14413255/