教程类人工智能进阶技巧 requests库教程 BeautifulSoup解析反反爬虫实战 Python爬虫入门爬虫绕过技巧

Python爬虫入门与防反爬实战：从零开始抓取数据

零点119官方团队2026-05-192026-05-19

引言

你是否曾想从网站上自动获取数据，却不知从何下手？或者写好的爬虫运行几次就被封了？别担心，这篇文章将带你从零开始，一步步掌握Python爬虫的基础，并学会如何应对常见的反爬虫措施。

一、爬虫是什么？

简单来说，爬虫就是模拟浏览器访问网页并提取信息的程序。就像你去图书馆看书，先找到书架（URL），然后翻开书（请求网页），最后抄下你需要的段落（解析数据）。

二、准备工作：安装Python库

我们需要两个核心库：requests（发送网络请求）和 BeautifulSoup（解析HTML）。打开终端或命令提示符，输入：

1	pip install requests beautifulsoup4

三、第一个爬虫：抓取网页标题

让我们从最简单的开始：获取一个网页的标题。

import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.title.string
print('网页标题:', title)

解释：

requests.get(url) 向网站发送请求，返回响应对象。
response.text 是网页的HTML源码。
BeautifulSoup 将HTML解析成可操作的对象。
soup.title.string 提取标签内的文本。

四、模拟浏览器：设置User-Agent

很多网站会检查请求头中的User-Agent，如果发现是Python的默认标识，就会拒绝访问。我们需要伪装成浏览器。

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers)

这样，网站就会认为你是一个真实的Chrome浏览器用户。

五、处理动态内容：等待与渲染

有些网页的内容是通过JavaScript动态加载的，直接获取HTML可能得不到数据。这时可以使用requests的会话保持，或者使用Selenium模拟浏览器。

5.1 使用Selenium

1	pip install selenium

下载对应浏览器的驱动（如ChromeDriver），然后：

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()  # 确保chromedriver在PATH中
driver.get(url)
time.sleep(3)  # 等待页面加载
content = driver.find_element(By.ID, 'dynamic-content').text
print(content)
driver.quit()

5.2 使用等待机制

更优雅的方式是显式等待：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, 'dynamic-content')))

六、防反爬实战：常见反爬措施与对策

6.1 IP限制

网站可能对同一IP的访问频率进行限制。对策：

使用代理IP池：轮换不同IP。
设置延迟：time.sleep(random.uniform(1, 3))。

import random
import time

time.sleep(random.uniform(1, 3))  # 随机等待1-3秒

6.2 验证码

遇到验证码时，可以使用第三方打码平台（如超级鹰），或者尝试OCR识别简单验证码。

6.3 Cookie与Session

有些网站需要登录才能访问数据。使用requests.Session保持会话：

session = requests.Session()
login_data = {'username': 'your_username', 'password': 'your_password'}
session.post(login_url, data=login_data)
response = session.get(target_url)

6.4 请求头完整性

除了User-Agent，有时还需要添加Referer、Origin等头：

headers = {
    'User-Agent': '...',
    'Referer': 'https://www.example.com',
    'Origin': 'https://www.example.com'
}

6.5 JavaScript挑战

部分网站使用Cloudflare等防护，需要执行JavaScript才能获得真实页面。这时可以使用cloudscraper库：

1	pip install cloudscraper

import cloudscraper

scraper = cloudscraper.create_scraper()
response = scraper.get(url)

七、数据解析进阶

7.1 使用CSS选择器

BeautifulSoup支持CSS选择器：

1	soup.select('div.content > p')

7.2 正则表达式

对于复杂模式，可以结合re模块：

1
2
3

import re
pattern = r'\d{4}-\d{2}-\d{2}'
dates = re.findall(pattern, response.text)

7.3 解析JSON

很多网站通过API返回JSON数据，直接解析即可：

1 2	import json data = json.loads(response.text)

八、实战案例：抓取电影列表

假设我们要抓取一个电影网站的电影名称和评分。

import requests
from bs4 import BeautifulSoup

url = 'https://example-movie-site.com/top'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

movies = []
for item in soup.select('.movie-item'):
    title = item.select_one('.title').text
    rating = item.select_one('.rating').text
    movies.append({'title': title, 'rating': rating})

print(movies)

九、总结与注意事项

尊重robots.txt：查看网站的爬取规则。
控制频率：不要给服务器造成压力。
数据合法性：遵守网站的使用条款。

爬虫技术是一把双刃剑，合理使用才能发挥价值。希望这篇文章能帮你入门，并在实战中游刃有余！