爬取糗事百科的内容和图片并展示


date: 2018-01-05 22:00:00
status: public
title: '爬取糗事百科的内容和图片并展示'
tags: Python 3.6,MySQL, Tkinter, urllib, bs4, md5, random


一些想法和思路

其实我本人不是很喜欢糗事百科,不过如我之前说的,大家都在爬,管那么多,爬就是了。

我在开写之前在网上找了一些资料,有的比较旧,是糗百改版之前的代码;好一点的做了命令行界面,按回车可以不断读取糗事。不过似乎都有意无意的忽视了带图片的糗事,我就想干脆做个类似糗百客户端的东西,同时支持图片阅读。

同时我也进一步的规整了一下我自己的爬虫代码,比如说配置文件,添加随机User-Agent,单元测试代码等等,以后写起来就可以复制粘贴了。

代码主要分成三个部分:UI、封装的数据库接口和爬虫代码。UI代码简单,先大概画个草图,然后照着写代码就行了,函数不记得手册查一查,很快。

数据库接口包含了:

  • DBconnect(),连接数据库,参数从配置文件里读;同时新建qiushibaike表;
  • DBupdate(url, md5, author, fun, comment, content, img_urls=None),向qiushibaike表中添加一条记录(糗事);
  • DBquery(),返回所有未读(isread = 0)记录中id最大的那条数据;
  • DBTotal(),返回qiushibaike表中的记录总数;
  • DuplicationCheck(md5),根据传入md5值判断数据库中是否已经存在这条数据,md5是一条记录的数据指纹;
  • DBdrop(),删除qiushibaike表;
  • DBclose(),关闭连接;
  • DBtest(),单元测试代码。

数据库的代码下面有,这里再说下爬虫代码:

  • 糗百的前端代码是很整洁的(跟baidu比起来),简单分析一下就能找到我们需要的结构;同时,糗百的几个标签页面的结构都是一样的(只有一点细微的差别,后面或者代码里会写明),所以我的爬虫支持爬取糗百任意标签下的内容。下图是我们需要用到的结构的位置:
qiushi-struct.jpg
  • 爬取的时候没有用代理,只是在每次爬取操作中间设置了1s的延迟,同时会带一个随机的User-Agent头。但是有时候返回的页面不能被准确的解析,参考qiushibaiky.py Line 48。我没找到原因,我自己测试了解析失败的场景,它返回的页面是有内容的,但是在用bs4的css选择器解析时会出现找不到对象的情况;而且再次操作一次又能成功,所以我觉得可能是解析器的问题,不知道换一种解析器或者换一种css选择语句会不会解决。我现在的解决方案就是重新抓取一次当前页面。
  • 每一个糗百页面都有唯一的url,这里没有爬取其评论,但是用这个url的md5值作为了数据指纹,在当前糗百故事写入数据库之前会先进行一次比较,如果已经存在就不会再写入,以此达到了去重的目的。这里没有用BloomFiter,redis之类的是因为数据量很小,将所有的页面爬下来也就几千条数据。给数据库中的md5字段上一个索引,速度完全够了。还有一点,这里我本来想用mongodb的,然而它好像没有32-bit的安装包,只能作罢。
  • 还有一点值得一提的就是抓取”下一页“连接的问题,“热门“标签下的糗事是固定的13页,而且在第13页的时候原本”下一页“的标签会变为”更多“,”24小时“也是这样;”热门“标签也是13页,但是最后一页没有”下一页“也没有”更多“标签,”文字“标签也是这样;”穿越“标签的页数不定,同时有可能为空,即那天没有糗事;”糗图“标签是35页,最后一页也是既没有”下一页“也没有”更多“标签,”新鲜“也是这样。
    • 页数的问题比较容易解决,可以不管它有多少页,只抓”下一页“的链接即可。
    • 而”下一页“、”更多“或者空就需要更多的逻辑判断,具体参见的代码。

效果图

抓取时的样子:

qiushi-spider.jpg

表中数据,可以看到有的有图片,有点没有:

qiushi-db.jpg

不带图片的糗事:

qiushi-nopic.jpg

带图片的糗事:

qiushi-pic.jpg

源码

configure.py

# DB
DB_HOST = '192.168.153.131'
DB_PORT = 3306
DB_DBNAME = 'spider'
DB_USER = 'root'
DB_PASSWORD = '123123'
DB_CHARSET = 'utf8mb4'

# User-Agents
FakeUserAgents = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12",
    "Opera/9.27 (Windows NT 5.2; U; zh-cn)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Version/3.1 Safari/525.13",
    "Mozilla/5.0 (iPhone; U; CPU like Mac OS X) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/4A93 ",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 ",
    "Mozilla/5.0 (Linux; U; Android 3.2; ja-jp; F-01D Build/F0001) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13 ",
    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; ja-jp) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7",
    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; da-dk) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5 ",
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_6; en-US) AppleWebKit/530.9 (KHTML, like Gecko) Chrome/ Safari/530.9 ",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Ubuntu/11.10 Chromium/27.0.1453.93 Chrome/27.0.1453.93 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36",
    "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Mobile Safari/537.36"
]

db.py

import pymysql.cursors
import configure

conn = None

# 简单封装一下两个MOD
def __DMLExecutionMod(sql):
    global conn

    try:
        with conn.cursor() as cursor:
            cursor.execute(sql)
        conn.commit()
    except Exception as e:
        conn.rollback()
        print ("DB Exception: %s", e)

def __DQLExecutionMod(sql):
    global conn

    try:
        with conn.cursor() as cursor:
            cursor.execute(sql)
            res = cursor.fetchall()
        conn.commit()
    except Exception as e:
        conn.rollback()
        print ("DB Exception: %s", e)
    
    return res

# Connect
def DBconnect():
    global conn

    config = {
        'host':configure.DB_HOST,
        'port':configure.DB_PORT,
        'user':configure.DB_USER,
        'password':configure.DB_PASSWORD,
        'db':configure.DB_DBNAME,
        'charset':configure.DB_CHARSET,
        'cursorclass':pymysql.cursors.DictCursor,
        }

    if conn == None:
        conn = pymysql.connect(**config)

    # init table
    sql = "CREATE TABLE IF NOT EXISTS `qiushibaike`  (\
            `id` int(11) NOT NULL AUTO_INCREMENT,\
            `isread` int(11) NULL DEFAULT 0,\
            `url` varchar(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL COMMENT 'url_md5 = md5(url)',\
            `url_md5` binary(64) NOT NULL,\
            `author` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,\
            `fun` int(255) NULL DEFAULT NULL,\
            `comment` int(255) NULL DEFAULT NULL,\
            `content` varchar(500) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,\
            `img_url` varchar(500) CHARACTER SET latin1 COLLATE latin1_swedish_ci NULL DEFAULT NULL,\
            PRIMARY KEY (`id`) USING BTREE,\
            UNIQUE INDEX `idx_id`(`id`) USING BTREE,\
            UNIQUE INDEX `idx_url_md5`(`url_md5`) USING BTREE\
            ) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Compact;\
        "   

    __DMLExecutionMod(sql)

# Add ONE record into the table
def DBupdate(url, md5, author, fun, comment, content, img_urls=None):
    global conn

    if img_urls == None:
        img_urls = 'null'
    else:
        img_urls = "'" + img_urls + "'"

    sql = "INSERT INTO `qiushibaike`\
            (`url`, `url_md5`, `author`, `fun`, `comment`, `content`, `img_url`)\
            VALUES\
            ('{0:s}', HEX('{1:s}'), '{2:s}', {3:d}, {4:d}, '{5:s}', \
            {6:s});".format(url, md5, author, fun, comment, content, img_urls).replace('    ', '')

    __DMLExecutionMod(sql)

    return True

# Retrieve ONE random record
def DBquery():
    global conn

    sql = "SELECT `id`, `url`, `author`, `fun`, `comment`, `content`, `img_url`\
                FROM `qiushibaike` WHERE isread = 0 \
                ORDER BY `id` DESC LIMIT 1;".replace('  ', '')

    res = __DQLExecutionMod(sql)


    sql = "UPDATE `qiushibaike` SET isread = 1 WHERE id = {0:d};".format(res[0]['id'])
    __DMLExecutionMod(sql)

    return res

# 获取总数
def DBTotal():
    global conn;
    sql = "SELECT count(*) as `total` FROM `qiushibaike`;"

    res = __DQLExecutionMod(sql)

    return res[0]['total']

# duplication check
def DuplicationCheck(md5):
    global conn
    sql = "SELECT count(*) AS `num` FROM `qiushibaike` WHERE url_md5 = HEX('{0:s}');".format(md5)

    res = __DQLExecutionMod(sql)

    if res[0]['num']:   
        return True
    else:
        return False

# Drop this table
def DBdrop():
    global conn
    __DMLExecutionMod("DROP TABLE `qiushibaike`;")

    return True

# close
def DBclose():
    global conn
    if conn is not None:
        conn.close()

def DBtest():
    DBconnect()

    assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372140', 'ethan', 12, 13, 'aaaa', None), 'update fail - 1'
    assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372141', 'ethan', 12, 14, 'aaaa', 'http://a;http://b;'), 'update fail - 2'
    assert True == DBupdate('http://www.google.com', 'ed646a3334ca891fd3467db131372142', 'ethan', 12, 15, 'aaaa', None), 'update fail - 3'

    res = DBquery()
    assert 1 == len(res), 'query fail - 11'
    assert 15 == res[0]['comment'], 'query fail - 12'

    res = DBquery()
    assert 1 == len(res), 'query fail - 21'
    assert 14 == res[0]['comment'], 'query fail - 22'

    assert 3 == DBTotal(), 'query fail - 31'

    assert True == DuplicationCheck('ed646a3334ca891fd3467db131372142'), 'duplicate fail - 1'
    assert False == DuplicationCheck('11111111111111111111111111111111'), 'duplicate fail - 2'

    assert True == DBdrop(), 'drop fail'
    DBclose()

# test
if __name__ == '__main__':
    DBtest()

ui.py

import tkinter as tk
import tkinter.messagebox
import webbrowser
from tkinter import END
from PIL import Image, ImageTk
import urllib.request

import db as datasourse
import qiushibaike as qb

def init_ui():
    root = tk.Tk()
    root.title('糗事百科私人阅读器')
    width = 600
    height = 440
    screenwidth = root.winfo_screenwidth()  
    screenheight = root.winfo_screenheight()  
    size = '%dx%d+%d+%d' % (width, height, (screenwidth - width)/3, (screenheight - height)/3)
    root.geometry(size)

    # 作者,好笑,评论字段
    lf_content = tk.LabelFrame(root, width=580, height=350)  
    lf_content.grid(row=0, column=0, sticky='w',padx=10, pady=10, columnspan=3)

    lstr_author = tk.StringVar()
    lstr_author.set("作者: ")
    lstr_fun_comment = tk.StringVar()
    lstr_fun_comment.set("0 好笑 0 评论")
    lstr_url = tk.StringVar()
    lstr_url.set("源地址:")
    lstr_url_val = tk.StringVar()
    href = ""

    label_author = tk.Label(lf_content,
        textvariable = lstr_author,
        width= 24, 
        height = 1,
        font = ('Microsoft YaHei', 12),
        anchor='w'
        )
    label_author.place(x=5, y=2)

    label_fun_comment = tk.Label(lf_content,
        textvariable = lstr_fun_comment,
        width= 24, 
        height = 1,
        font = ('Microsoft YaHei', 8),
        anchor='w'
        )
    label_fun_comment.place(x=5, y=30)

    label_url = tk.Label(lf_content,
        textvariable = lstr_url,
        width= 48, 
        height = 1,
        font = ('Microsoft YaHei', 10),
        anchor='w'
        )
    label_url.place(x=5, y=52)

    # 将URL做成可以点击的超链接
    def callback(event):
        global href
        webbrowser.open_new(href)

    label_url_val = tk.Label(lf_content,
        textvariable = lstr_url_val,
        fg='blue',
        cursor='hand2',
        width= 48, 
        height = 1,
        font = ('Microsoft YaHei', 10),
        anchor='w'
        )
    label_url_val.place(x=55, y=52)
    label_url_val.bind("<Button-1>", callback)

    # 文本组件
    textbox = tk.Text(lf_content, 
        width=62,
        height=12,
        relief='solid',
        font = ('Microsoft YaHei', 12),
        #state = 'disabled'
    )
    textbox.place(x=5,y=80)     

    # 进行1次爬取
    def button_spider_click():
        count = qb.OneCircleSpider()
        tk.messagebox.showinfo(title='HI', message='本次新抓取{0:d}了条记录。'.format(count))

    # 取一条记录并解析
    def button_luck_click():
        if 0 == datasourse.DBTotal():
            tk.messagebox.showinfo(title='HI', message='你已经看完了所有的百科,再抓一些吧!'.format(count))

        # 解析
        record = datasourse.DBquery()[0]
        lstr_author.set("作者: {0:s}".format(record['author']))
        lstr_fun_comment.set("{0:d} 好笑 {0:d} 评论".format(record['fun'], record['comment']))
        lstr_url_val.set(record['url'])
        global href
        href = record['url']

        # textbox在disabled状态下不能添加内容
        # 先改成normal,加完内容再改回来
        textbox.configure(state='normal')
        existed_text = textbox.get("1.0", END).strip()
        if existed_text:
            textbox.delete("1.0", END)
        textbox.insert('insert', record['content'])
        textbox.configure(state='disabled')

        # 无论如何先把图片按钮disable
        # 如果有图片,下载图片,enable图片按钮
        button_img.configure(state='disabled')
        if record['img_url']:
            urllib.request.urlretrieve(record['img_url'],filename='test.jpg')
            button_img.configure(state='normal')

    def button_img_click():
        # 新建一个窗口,大小和图片一样
        img_window = tk.Toplevel(root)
        img_window.title("图片查看")
        image = Image.open("test.jpg")
        # 这里为什么+4?为了对称
        img_window_size = '%dx%d+%d+%d' % (image.width + 4, image.height + 4, (screenwidth - image.width)/2, (screenheight - image.height)/2)
        img_window.geometry(img_window_size)
                        
        img = ImageTk.PhotoImage(image)
        canvas = tk.Canvas(img_window, width = image.width ,height = image.height, bg = 'grey')
        # create_image()的前两个参数代表的是图片**中心**的坐标轴
        canvas.create_image(image.width//2, image.height//2, image=img)
        canvas.place(x=0,y=0)

        img_window.mainloop()

    # 三个按钮
    button_spider = tk.Button(root,
        text='抓取更多',
        width=10,
        height=2,
        font = ('Microsoft YaHei', 12),
        command=button_spider_click
        )
    button_spider.grid(row=1, column=0, sticky='we',padx=10)

    button_img = tk.Button(root,
        text='显示图片',
        width=10,
        height=2,
        font = ('Microsoft YaHei', 12),
        state = 'disabled',
        command=button_img_click
        )
    button_img.grid(row=1, column=1, sticky='we',padx=10)

    button_luck = tk.Button(root,
        text='手气不错',
        width=10,
        height=2,
        font = ('Microsoft YaHei', 12),
        command=button_luck_click
        )
    button_luck.grid(row=1, column=2, sticky='we',padx=10)

    root.mainloop()

if __name__ == '__main__':
    datasourse.DBconnect()
    init_ui()
    datasourse.DBclose()

qiushibaike.py

# Standard Lib
import urllib
import hashlib
import time
from urllib import request
from urllib import error
from bs4 import BeautifulSoup
from random import choice

# User Lib
import db
import ui
import configure

# 这里几个标签的URL都可以爬,因为结构都是一样的
# 依次是热门,24小时,热图,文字,穿越,糗图,新鲜
TargetURLs = ['https://www.qiushibaike.com/',
            'https://www.qiushibaike.com/imgrank/',
            'https://www.qiushibaike.com/hot/',
            'https://www.qiushibaike.com/text/',
            'https://www.qiushibaike.com/history/',
            'https://www.qiushibaike.com/pic/',
            'https://www.qiushibaike.com/textnew/'
        ]

Domain = 'https://www.qiushibaike.com'

def OnepageSpider(myTargetURL = choice(TargetURLs)):
    print ("Start to spider: {0:s}".format(myTargetURL))
    try:
        # 构建请求
        req = request.Request(myTargetURL)
        req.add_header("User-Agent",choice(configure.FakeUserAgents))
        response = request.urlopen(req)
        if response.getcode() != 200:
            print ("HTTP Request Code: {0:d}".format(response.getcode()))
            return myTargetURL, 0
        html = response.read()
    except error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)

    # 用bs4解析
    soup = BeautifulSoup(html, 'lxml')

    # 这里有时候会失败,但是再试一次就能成功,所以加个判断
    # 失败的原因没找到,初步断定不是网络波动
    # 可能和css选择器的表达式有关
    if soup.select('div.col1'): 
        results = soup.select('div.col1')[0].select("div.article")
    else:
        print ("SOMETHING IS WRONG, TRY AGAIN LATER.")
        return myTargetURL, 0

    # 解析数据并写入DB
    count = 0
    for res in results:
        # 首先解析URL,判断是否已经在数据库里
        url = Domain + res.find_all('a', class_='contentHerf')[0].get('href')
        # md5
        m = hashlib.md5()
        m.update(url.encode('utf-8'))
        url_md5 = m.hexdigest()
        
        if db.DuplicationCheck(url_md5):
            continue

        # 不在数据库里,继续解析其他值
        author = res.find('h2').get_text().strip()
        
        stat = res.find_all('i', class_='number')
        
        # 如果评论数是0,就会不显示
        # 我暂时没找到好笑数是0的帖子,不过也这样写了
        if len(stat) == 0:
            fun, comment = 0
        elif len(stat) == 1:
            fun = stat[0].get_text()
            comment = 0
        else:
            fun = stat[0].get_text()
            comment = stat[1].get_text()

        content = res.select("div.content span")[0].get_text().strip()

        if res.select("div.thumb"):
            img_urls = "https:" + res.select("div.thumb img")[0].get('src')
        else:
            img_urls = None


        if True == db.DBupdate(url, url_md5, author, int(fun), int(comment), content, img_urls):
            count += 1

    
    # 解析下一页的URL,并返回这个URL
    next = soup.select('div.col1 ul.pagination li')[-1].a
    # 这个地方这么写,是因为有的页面最后一页是个“更多”的标签,而有的是空的
    # 为了适配所有页面的抓取,要多加一个判断
    if next and next.span.get_text().strip() == '下一页':
        next_url = Domain + next.get('href')
    else:
        next_url = None

    return next_url, count

# 从URL里找一个抓一轮
# 就是一直抓到没有下一页为止,一般是13页,有一个页面是25页
def OneCircleSpider():
    total = 0

    next_url, num = OnepageSpider()
    print ("Spider One Page. Add {0:d} record(s)".format(num))
    total += num
    
    while next_url:
        next_url, num = OnepageSpider(next_url)
        total += num
        print ("Spider One Page. Add {0:d} record(s)".format(num))
        time.sleep(1)
    
    print ("Add {0:d} record(s) in this circle".format(total))
    
    return total

def main():
    db.DBconnect()
    ui.init_ui()
    db.DBclose()
    
if __name__ == '__main__':
    main()

推荐阅读更多精彩内容