Python爬虫开发详细教程

爬虫(Web Crawler)是一种自动获取网页内容的程序,是搜索引擎的重要组成部分。Python由于其丰富的库和简洁的语法,成为开发爬虫的热门语言。本教程将详细介绍Python爬虫开发的各个方面。

一、爬虫基础知识

1. 爬虫工作原理

  1. 发送HTTP请求获取网页内容
  2. 解析获取的网页内容
  3. 存储有用数据
  4. 根据需求跟踪链接继续爬取

2. 合法性与道德规范

  • 遵守robots.txt协议
  • 不要对网站造成过大访问压力
  • 尊重网站版权和数据隐私
  • 商业用途需获得授权

二、Python爬虫基础库

1. Requests库

用于发送HTTP请求

import requests
# GET请求
response = requests.get('https://www.example.com')
print(response.status_code)  # 状态码
print(response.text)        # 响应内容
# 带参数的GET请求
params = {'key1': 'value1', 'key2': 'value2'}
response = requests.get('https://www.example.com', params=params)
# POST请求
data = {'key': 'value'}
response = requests.post('https://www.example.com', data=data)
# 添加headers
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get('https://www.example.com', headers=headers)

2. BeautifulSoup库

用于解析HTML/XML文档

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters...</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# 获取title标签内容
print(soup.title.string)
# 获取所有p标签
for p in soup.find_all('p'):
    print(p.get_text())
# 根据class查找
print(soup.find_all('p', class_='title'))
# 根据属性查找
print(soup.find_all(attrs={'class': 'story'}))

三、进阶爬虫技术

1. 处理动态加载内容(Selenium)

对于JavaScript渲染的页面,需要使用Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
# 启动浏览器(需要下载对应浏览器的driver)
driver = webdriver.Chrome()
# 打开网页
driver.get("https://www.example.com")
# 查找元素并交互
search_box = driver.find_element(By.NAME, "q")
search_box.send_keys("Python爬虫")
search_box.send_keys(Keys.RETURN)
# 等待加载
time.sleep(3)
# 获取页面内容
print(driver.page_source)
# 关闭浏览器
driver.quit()

2. 处理AJAX请求

有些数据通过AJAX加载,可以直接分析API请求

import requests
import json
# 假设这是通过浏览器开发者工具找到的API接口
url = "https://api.example.com/data"
headers = {
    'User-Agent': 'Mozilla/5.0',
    'X-Requested-With': 'XMLHttpRequest'
}
response = requests.get(url, headers=headers)
data = json.loads(response.text)
print(data)

3. 使用代理IP

防止IP被封禁

import requests
proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
response = requests.get('http://example.com', proxies=proxies)

四、数据存储

1. 存储到文件(JSON/CSV)

import json
import csv
# JSON存储
data = {'key': 'value'}
with open('data.json', 'w', encoding='utf-8') as f:
    json.dump(data, f, ensure_ascii=False)
# CSV存储
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['标题', '作者', '日期'])  # 写入表头
    writer.writerow(['Python爬虫', '张三', '2023-01-01'])

2. 存储到数据库(MySQL/MongoDB)

# MySQL存储
import pymysql
conn = pymysql.connect(host='localhost', user='root', password='123456', database='test')
cursor = conn.cursor()
# 创建表
cursor.execute("""
CREATE TABLE IF NOT EXISTS articles (
    id INT AUTO_INCREMENT PRIMARY KEY,
    title VARCHAR(255),
    content TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
# 插入数据
cursor.execute("INSERT INTO articles (title, content) VALUES (%s, %s)", 
              ('Python爬虫教程', '这是关于Python爬虫的详细教程'))
conn.commit()
conn.close()
# MongoDB存储
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['test_db']
collection = db['articles']
# 插入数据
article = {
    'title': 'Python爬虫教程',
    'content': '这是关于Python爬虫的详细教程',
    'date': '2023-01-01'
}
collection.insert_one(article)

五、爬虫框架

1. Scrapy框架

Scrapy是一个强大的爬虫框架 安装: pip install scrapy 创建项目: scrapy startproject myproject 示例Spider:

import scrapy
class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://www.example.com']

    def parse(self, response):
        # 提取数据
        title = response.css('title::text').get()
        yield {'title': title}

        # 跟踪链接
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link, callback=self.parse)

运行爬虫: scrapy crawl example -o output.json

六、反爬虫策略与应对

常见反爬措施及应对方法

  1. User-Agent检测 - 随机更换User-Agent
    import random
    user_agents = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...'
    ]
    headers = {'User-Agent': random.choice(user_agents)}
    
  2. IP限制 - 使用代理IP池
  3. 验证码 - 使用打码平台或机器学习识别
  4. 登录限制 - 使用Session/Cookie保持登录状态
    session = requests.Session()
    login_data = {'username': 'user', 'password': 'pass'}
    session.post('https://example.com/login', data=login_data)
    response = session.get('https://example.com/protected-page')
    
  5. 动态参数 - 分析JavaScript生成参数
  6. 请求频率限制 - 控制爬取速度
    import time
    time.sleep(random.uniform(1, 3))  # 随机延迟
    

七、分布式爬虫

使用Scrapy-Redis实现分布式爬虫

  1. 安装: pip install scrapy-redis
  2. 修改Scrapy项目的settings.py:
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    REDIS_URL = 'redis://localhost:6379'
    
  3. 修改Spider继承RedisSpider
    from scrapy_redis.spiders import RedisSpider
    class MySpider(RedisSpider):
        name = 'myspider'
        redis_key = 'myspider:start_urls'
    

八、爬虫优化技巧

  1. 并发请求 - 使用aiohttp实现异步爬虫
    import aiohttp
    import asyncio
    async def fetch(session, url):
        async with session.get(url) as response:
            return await response.text()
    async def main():
        async with aiohttp.ClientSession() as session:
            html = await fetch(session, 'http://example.com')
            print(html)
    asyncio.run(main())
    
  2. 增量爬取 - 记录已爬取URL
  3. 断点续爬 - 保存爬取状态
  4. 数据去重 - 使用Bloom Filter等算法

九、实战项目

项目1: 爬取新闻网站

import requests
from bs4 import BeautifulSoup
import csv
def scrape_news():
    url = 'https://news.example.com'
    headers = {'User-Agent': 'Mozilla/5.0'}

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    news_list = []
    for article in soup.select('.news-item'):
        title = article.select_one('.title').get_text(strip=True)
        link = article.select_one('a')['href']
        time = article.select_one('.time').get_text(strip=True)
        news_list.append([title, link, time])

    with open('news.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['标题', '链接', '时间'])
        writer.writerows(news_list)
if __name__ == '__main__':
    scrape_news()

项目2: 爬取电商网站商品信息

import requests
from bs4 import BeautifulSoup
import json
import time
import random
def get_product_info(url):
    headers = {
        'User-Agent': 'Mozilla/5.0',
        'Referer': 'https://www.example.com'
    }

    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'html.parser')

        product = {
            'title': soup.select_one('.product-title').get_text(strip=True),
            'price': soup.select_one('.price').get_text(strip=True),
            'rating': soup.select_one('.rating').get_text(strip=True),
            'reviews': soup.select_one('.review-count').get_text(strip=True)
        }

        return product

    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None
def scrape_products(base_url, pages=5):
    products = []

    for page in range(1, pages+1):
        print(f"Scraping page {page}...")
        url = f"{base_url}?page={page}"

        product = get_product_info(url)
        if product:
            products.append(product)

        # 随机延迟防止被封
        time.sleep(random.uniform(1, 3))

    with open('products.json', 'w', encoding='utf-8') as f:
        json.dump(products, f, ensure_ascii=False, indent=2)
if __name__ == '__main__':
    scrape_products('https://shop.example.com/products')

十、爬虫部署

1. 定时运行

  • Windows: 任务计划程序
  • Linux: crontab
    # 每天凌晨1点运行
    0 1 * * * /usr/bin/python3 /path/to/your/spider.py
    

2. 服务器部署

  • 使用Docker容器化
  • 使用Scrapyd部署Scrapy项目

总结

本教程涵盖了Python爬虫开发的各个方面,从基础库使用到高级技巧,再到实战项目和部署。实际开发中还需要注意:

  1. 遵守目标网站的robots.txt规则
  2. 合理设置爬取间隔,避免对网站造成负担
  3. 处理异常情况,增强爬虫的健壮性
  4. 定期维护爬虫代码,适应网站改版 希望本教程能帮助你掌握Python爬虫开发技能!









results matching ""

    No results matching ""