asp网站一打开就是download,php做的网站怎么发布,wordpress 获取某个栏目名称,如何制作个人网页页爬虫
爬虫原理
爬虫#xff0c;又称网络爬虫#xff0c;是一种自动获取网页内容的程序。它模拟人类浏览网页的行为#xff0c;发送HTTP请求#xff0c;获取网页源代码#xff0c;再通过解析、提取等技术手段#xff0c;获取所需数据。
HTTP请求与响应过程
爬虫首先向…爬虫
爬虫原理
爬虫又称网络爬虫是一种自动获取网页内容的程序。它模拟人类浏览网页的行为发送HTTP请求获取网页源代码再通过解析、提取等技术手段获取所需数据。
HTTP请求与响应过程
爬虫首先向目标网站发送HTTP请求请求可以包含多种参数如URL、请求方法GET或POST、请求头Headers等。服务器接收到请求后返回相应的HTTP响应包括状态码、响应头和响应体网页内容。
常用爬虫技术
名称功能请求库如 requests、aiohttp 等解析库如 BeautifulSoup、lxml、PyQuery 等存储库如 pandas、SQLite 等异步库如 asyncio、aiohttp 等
实战
爬取豆瓣电影Top250
import requests
from bs4 import BeautifulSoup
import csv
# 请求 URL
url https://movie.douban.com/top250
# 请求头部
headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
}
# 解析页面函数
def parse_html(html):soup BeautifulSoup(html, lxml)movie_list soup.find(ol, class_grid_view).find_all(li)for movie in movie_list:title movie.find(div, class_hd).find(span, class_title).get_text()rating_num movie.find(div, class_star).find(span, class_rating_num).get_text()comment_num movie.find(div, class_star).find_all(span)[-1].get_text()writer.writerow([title, rating_num, comment_num])# 保存数据函数
def save_data():f open(douban_movie_top250.csv, a, newline, encodingutf-8-sig)global writerwriter csv.writer(f)writer.writerow([电影名称, 评分, 评价人数])for i in range(10):url https://movie.douban.com/top250?start str(i*25) filterresponse requests.get(url, headersheaders)parse_html(response.text)f.close()if __name__ __main__:save_data()爬取当当网图书信息
import requests
from lxml import etree
import csvurl http://search.dangdang.com/?keyPythonactinput
headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
}def parse_html(html):selector etree.HTML(html)book_list selector.xpath(//*[idsearch_nature_rg]/ul/li)for book in book_list:title book.xpath(a/title)if title:title title[0]else:title 未知书名link book.xpath(a/href)if link:link link[0]else:link 未知链接price book.xpath(p[classprice]/span[classsearch_now_price]/text())if price:price price[0]else:price 未知价格author book.xpath(p[classsearch_book_author]/span[1]/a/title)if author:author author[0]else:author 未知作者publish_date book.xpath(p[classsearch_book_author]/span[2]/text())if publish_date:publish_date publish_date[0]else:publish_date 未知出版日期publisher book.xpath(p[classsearch_book_author]/span[3]/a/title)if publisher:publisher publisher[0]else:publisher 未知出版社yield {书名: title,链接: link,价格: price,作者: author,出版日期: publish_date,出版社: publisher}def save_data():response requests.get(url, headersheaders)if response.status_code 200:with open(dangdang_books.csv, w, newline, encodingutf-8-sig) as f:writer csv.writer(f)writer.writerow([书名, 链接, 价格, 作者, 出版日期, 出版社])for item in parse_html(response.text):writer.writerow([item[书名], item[链接], item[价格], item[作者], item[出版日期], item[出版社]])else:print(f请求失败状态码{response.status_code})if __name__ __main__:save_data()