九. 正则表达式、BeautifulSoup、Lxml性能对比 - 实例

96
橄榄的世界
2018.02.18 16:16* 字数 153

爬取网址:http://www.doupoxs.com/doupocangqiong
爬取内容:用户ID、发表段子文字信息、好笑数量、评价数量
爬取方式:正则表达式 & BeautifulSoup & lxml
性能对比:比较运行时间

import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
import time

##正则表达式
def re_info(r):
    ids = re.findall("<h2>(.*?)</h2>",r.text,re.S)       
    contents = re.findall('<div class="content">.*?<span>(.*?)</span>',r.text,re.S)
    laughs = re.findall('<span class="stats-vote">.*?<i class="number">(.*?)</i>',r.text,re.S)
    comments = re.findall('<span class="stats-comments">.*?<i class="number">(.*?)</i>',r.text,re.S)
    return [ids,contents,laughs,comments]

##BeautifulSoup
def bs4_info(r):
    soup = BeautifulSoup(r.text,"lxml")
    infos = soup.select("div.article")
    for info in infos:
        id = info.select("h2")[0].text.strip()
        content = info.select("div.content")[0].text.strip()
        laugh = info.select("span.stats-vote i")[0].text
        comment = info.select("span.stats-comments i")[0].text
        return [id,content,laugh,comment]
    
#lxml    
def lxml_info(r):
    html = etree.HTML(r.text)
    infos = html.xpath('//div[starts-with(@class,"article block untagged mb15")]')
    for info in infos:
        id = info.xpath('div[1]//h2/text()')[0]
        content = info.xpath('a[1]/div/span/text()')[0].strip()  #复制xpath时需添加/span标签
        laugh = info.xpath('div[2]/span[1]/i/text()')[0]
        comment = info.xpath('div[2]/span[2]/a/i/text()')[0]
        return [id,content,laugh,comment]

if __name__ == "__main__":
    url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1,14)]
    hds = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'}
    for name,get_info in [('re',re_info),('bs4',bs4_info),('lxml',lxml_info)]:
        start = time.time()
        for url in url_list:
            r = requests.get(url,headers = hds)
            get_info(r)
        stop = time.time()
        print(name,stop-start)

运行结果:正则表达式和Lxml的运行时间都比较快,BS4较慢。所以当数据量较大时,推荐使用Lxml。
不过,lxml的路径兼容性似乎较弱,尝试使用“//”时出错的可能性较大,最好列出完整路径,例如:div[2]/span[1]/i/text()。

re 2.6481516361236572
bs4 4.277244567871094
lxml 2.4631409645080566
python爬虫
Web note ad 1