Python爬虫开发详细教程
爬虫(Web Crawler)是一种自动获取网页内容的程序,是搜索引擎的重要组成部分。Python由于其丰富的库和简洁的语法,成为开发爬虫的热门语言。本教程将详细介绍Python爬虫开发的各个方面。
一、爬虫基础知识
1. 爬虫工作原理
- 发送HTTP请求获取网页内容
- 解析获取的网页内容
- 存储有用数据
- 根据需求跟踪链接继续爬取
2. 合法性与道德规范
- 遵守robots.txt协议
- 不要对网站造成过大访问压力
- 尊重网站版权和数据隐私
- 商业用途需获得授权
二、Python爬虫基础库
1. Requests库
用于发送HTTP请求
import requests
# GET请求
response = requests.get('https://www.example.com')
print(response.status_code) # 状态码
print(response.text) # 响应内容
# 带参数的GET请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://www.example.com', params=params)
# POST请求
data = {'key': 'value'}
response = requests.post('https://www.example.com', data=data)
# 添加headers
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.example.com', headers=headers)
2. BeautifulSoup库
用于解析HTML/XML文档
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters...</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 获取title标签内容
print(soup.title.string)
# 获取所有p标签
for p in soup.find_all('p'):
print(p.get_text())
# 根据class查找
print(soup.find_all('p', class_='title'))
# 根据属性查找
print(soup.find_all(attrs={'class': 'story'}))
三、进阶爬虫技术
1. 处理动态加载内容(Selenium)
对于JavaScript渲染的页面,需要使用Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
# 启动浏览器(需要下载对应浏览器的driver)
driver = webdriver.Chrome()
# 打开网页
driver.get("https://www.example.com")
# 查找元素并交互
search_box = driver.find_element(By.NAME, "q")
search_box.send_keys("Python爬虫")
search_box.send_keys(Keys.RETURN)
# 等待加载
time.sleep(3)
# 获取页面内容
print(driver.page_source)
# 关闭浏览器
driver.quit()
2. 处理AJAX请求
有些数据通过AJAX加载,可以直接分析API请求
import requests
import json
# 假设这是通过浏览器开发者工具找到的API接口
url = "https://api.example.com/data"
headers = {
'User-Agent': 'Mozilla/5.0',
'X-Requested-With': 'XMLHttpRequest'
}
response = requests.get(url, headers=headers)
data = json.loads(response.text)
print(data)
3. 使用代理IP
防止IP被封禁
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
response = requests.get('http://example.com', proxies=proxies)
四、数据存储
1. 存储到文件(JSON/CSV)
import json
import csv
# JSON存储
data = {'key': 'value'}
with open('data.json', 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False)
# CSV存储
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['标题', '作者', '日期']) # 写入表头
writer.writerow(['Python爬虫', '张三', '2023-01-01'])
2. 存储到数据库(MySQL/MongoDB)
# MySQL存储
import pymysql
conn = pymysql.connect(host='localhost', user='root', password='123456', database='test')
cursor = conn.cursor()
# 创建表
cursor.execute("""
CREATE TABLE IF NOT EXISTS articles (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255),
content TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
# 插入数据
cursor.execute("INSERT INTO articles (title, content) VALUES (%s, %s)",
('Python爬虫教程', '这是关于Python爬虫的详细教程'))
conn.commit()
conn.close()
# MongoDB存储
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['test_db']
collection = db['articles']
# 插入数据
article = {
'title': 'Python爬虫教程',
'content': '这是关于Python爬虫的详细教程',
'date': '2023-01-01'
}
collection.insert_one(article)
五、爬虫框架
1. Scrapy框架
Scrapy是一个强大的爬虫框架
安装: pip install scrapy
创建项目: scrapy startproject myproject
示例Spider:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://www.example.com']
def parse(self, response):
# 提取数据
title = response.css('title::text').get()
yield {'title': title}
# 跟踪链接
for link in response.css('a::attr(href)').getall():
yield response.follow(link, callback=self.parse)
运行爬虫: scrapy crawl example -o output.json
六、反爬虫策略与应对
常见反爬措施及应对方法
- User-Agent检测 - 随机更换User-Agent
import random user_agents = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...' ] headers = {'User-Agent': random.choice(user_agents)}
- IP限制 - 使用代理IP池
- 验证码 - 使用打码平台或机器学习识别
- 登录限制 - 使用Session/Cookie保持登录状态
session = requests.Session() login_data = {'username': 'user', 'password': 'pass'} session.post('https://example.com/login', data=login_data) response = session.get('https://example.com/protected-page')
- 动态参数 - 分析JavaScript生成参数
- 请求频率限制 - 控制爬取速度
import time time.sleep(random.uniform(1, 3)) # 随机延迟
七、分布式爬虫
使用Scrapy-Redis实现分布式爬虫
- 安装:
pip install scrapy-redis
- 修改Scrapy项目的settings.py:
SCHEDULER = "scrapy_redis.scheduler.Scheduler" DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" REDIS_URL = 'redis://localhost:6379'
- 修改Spider继承RedisSpider
from scrapy_redis.spiders import RedisSpider class MySpider(RedisSpider): name = 'myspider' redis_key = 'myspider:start_urls'
八、爬虫优化技巧
- 并发请求 - 使用aiohttp实现异步爬虫
import aiohttp import asyncio async def fetch(session, url): async with session.get(url) as response: return await response.text() async def main(): async with aiohttp.ClientSession() as session: html = await fetch(session, 'http://example.com') print(html) asyncio.run(main())
- 增量爬取 - 记录已爬取URL
- 断点续爬 - 保存爬取状态
- 数据去重 - 使用Bloom Filter等算法
九、实战项目
项目1: 爬取新闻网站
import requests
from bs4 import BeautifulSoup
import csv
def scrape_news():
url = 'https://news.example.com'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
news_list = []
for article in soup.select('.news-item'):
title = article.select_one('.title').get_text(strip=True)
link = article.select_one('a')['href']
time = article.select_one('.time').get_text(strip=True)
news_list.append([title, link, time])
with open('news.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['标题', '链接', '时间'])
writer.writerows(news_list)
if __name__ == '__main__':
scrape_news()
项目2: 爬取电商网站商品信息
import requests
from bs4 import BeautifulSoup
import json
import time
import random
def get_product_info(url):
headers = {
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.example.com'
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
product = {
'title': soup.select_one('.product-title').get_text(strip=True),
'price': soup.select_one('.price').get_text(strip=True),
'rating': soup.select_one('.rating').get_text(strip=True),
'reviews': soup.select_one('.review-count').get_text(strip=True)
}
return product
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
def scrape_products(base_url, pages=5):
products = []
for page in range(1, pages+1):
print(f"Scraping page {page}...")
url = f"{base_url}?page={page}"
product = get_product_info(url)
if product:
products.append(product)
# 随机延迟防止被封
time.sleep(random.uniform(1, 3))
with open('products.json', 'w', encoding='utf-8') as f:
json.dump(products, f, ensure_ascii=False, indent=2)
if __name__ == '__main__':
scrape_products('https://shop.example.com/products')
十、爬虫部署
1. 定时运行
- Windows: 任务计划程序
- Linux: crontab
# 每天凌晨1点运行 0 1 * * * /usr/bin/python3 /path/to/your/spider.py
2. 服务器部署
- 使用Docker容器化
- 使用Scrapyd部署Scrapy项目
总结
本教程涵盖了Python爬虫开发的各个方面,从基础库使用到高级技巧,再到实战项目和部署。实际开发中还需要注意:
- 遵守目标网站的robots.txt规则
- 合理设置爬取间隔,避免对网站造成负担
- 处理异常情况,增强爬虫的健壮性
- 定期维护爬虫代码,适应网站改版
希望本教程能帮助你掌握Python爬虫开发技能!