Python爬虫之Pyspider框架实战

pyspider是个蛮简洁的框架，爬取内容直接存放在resultdb里，可以web查看，超级方便实用。

废话少说，开始实战吧

本项目目的：

使用pyspider爬取顶点小说网的小说，并存入本地mysql数据库

思路：

代码逻辑很简单，先爬取小说分类的url，沿着分类爬取各类目下的小说名，然后再爬取各章节，最后获取到每章节内容，把需要的各个信息存入数据库

步骤：

1，pyspider all启动pyspider

2，新建一个项目

3，输入代码Handler

这里重点用到了response的save用来保存数据，以及覆盖了on_result以便存储到本地数据库

#!/usr/bin/env python

# -*- encoding: utf-8 -*-

# Created on 2017-07-20 16:56:22

# Project: dingdian

from pyspider.libs.base_handler import *

import re

from bs4 import BeautifulSoup

from pyspider.result import ResultWorker

from pyspider.database.mysql.mysqldb import SQL

class Handler(BaseHandler):

crawl_config = {

}

@every(minutes=24 * 60)

def on_start(self):

baseurl = 'http://www.x23us.com/class/'

sufix = '_1.html'

for i in range(1,11):

url = baseurl + str(i) + sufix

self.crawl(url,callback=self.index_page,validate_cert=False)

@config(age=10 * 24 * 60 * 60)

def index_page(self, response):

total_page_num = response.doc('.last').text()

total_page_num = int(total_page_num)

first = response.doc('.first').text()

first = int(first)

baseurl = 'http://www.x23us.com/class/1_'

sufix='.html'

for index in range(first,total_page_num+1):

url = baseurl + str(index) + sufix

self.crawl(url,callback=self.list_books,validate_cert=False)

def list_books(self, response):

items = response.doc('tr').items()

for item in items:

booktitle = item.find('.L').find('a').eq(1).text()

if len(booktitle)==0:

continue

c = item.find('.C')

author = item.find('.C').eq(0).text()

updatetime = item.find('.C').eq(1).text()

status = item.find('.C').eq(2).text()

latestchapter = item.find('.L').eq(1).text()

bookurl = item.find('.L').find('a').eq(1).attr('href')

savedata = {'booktitle':booktitle,'author':author,'updatetime':updatetime,'status':status}

self.crawl(bookurl,callback=self.list_chapter,save = savedata,validate_cert=False)

def list_chapter(self,response):

items = response.doc('.L').items()

booktitle = response.save['booktitle']

author = response.save['author']

updatetime = response.save['updatetime']

status = response.save['status']

for item in items:

chaptertitle = item.find('a').text()

chapterurl = item.find('a').attr('href')

savedata = {'booktitle':booktitle,'author':author,'updatetime':updatetime,'status':status,'chaptertitle':chaptertitle}

self.crawl(chapterurl,callback=self.list_content,save = savedata,validate_cert=False)

@config(priority=2)

def list_content(self,response):

nav=response.doc('dt > a').items()

navlist = []

for item in nav:

navlist.append(item.text())

if len(navlist) > 0:

category = navlist[1]

items = response.doc('h1').items()

prevnexturls = response.doc('h3').items()

contents = response.doc('#contents').items()

booktitle = response.save['booktitle']

author = response.save['author']

updatetime = response.save['updatetime']

status = response.save['status']

chaptertitle = response.save['chaptertitle']

for item in contents:

content = item.text()

for item in prevnexturls:

prevurl = item.find('a').eq(0).attr('href')

nexturl = item.find('a').eq(2).attr('href')

#for item in items:

# chaptertitle = item.text()

return {

"booktitle":booktitle,

"author":author,

"updatetime":updatetime,

"status":status,

"category":category,

"chaptertitle":chaptertitle,

"content":content

}

def on_result(self, result):

if not result or not result['booktitle']:

return

sql = SQL()

sql.replace('novel',**result)

其他的代码都很简单，重点说下存入本地数据库，

首先需要在C:\Python3.5\Lib\site-packages\pyspider\database\mysql目录下新建一个mysqldb.py模块，然后输入：

from six import itervalues

# import mysql.connector

import pymysql

from datetime import date, datetime, timedelta

class SQL:

username = 'root'

password = ''

database = 'dingdian'

host = 'localhost'

connection = ''

charset = 'utf8'

connect = True

placeholder = '%s'

def __init__(self):

if self.connect:

SQL.connect(self)

def escape(self,string):

return '`%s`' % string

def connect(self):

config={'user':SQL.username,'password':SQL.password,'host':SQL.host,'charset':SQL.charset}

if SQL.database != None:

config['database'] = SQL.database

try:

cnx = pymysql.connect(**config)

# cnx = mysql.connector.connect(**config)

SQL.connection = cnx

return True

except Exception as err:

print('Something went wrong',err)

def replace(self,tablename=None,**values):

if SQL.connection == '':

print('Please connect first')

return False

tablename = self.escape(tablename)

if values:

_keys = ",".join(self.escape(k) for k in values)

_values = ",".join([self.placeholder,]*len(values))

sql_query = "REPLACE INTO %s (%s) VALUES (%s)" % (tablename,_keys,_values)

else:

sql_query = "REPLACE INTO %s DEFAULT VALUES" % tablename

cur = SQL.connection.cursor()

try:

if values:

cur.execute(sql_query,list(itervalues(values)))

else:

cur.execute(sql_query)

SQL.connection.commit()

return True

except Exception as err:

print("An error occured :{}".format(err))

return False

再通过wamp中的phpmyadmin来新建一个dingdian数据库和novel表，

表的字段包括id,boottitle,chaptertitle,category,author,status,content,updatetime

这样你就实现了顶点小说网的小说爬取了。具体代码

https://github.com/chenxiang2017/spidersamples/tree/master/dingdian/dingdianpyspider

注意，我这里连接mysql用的是pymysql，如果没装，需要pip install pymysql安装下。

最后编辑于：2017.12.09 02:11:30

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,847评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,208评论 1赞 292
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,587评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,942评论 0赞 205
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,332评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,587评论 1赞 218
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,853评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,568评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,273评论 1赞 242
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,542评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,033评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,373评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,031评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,073评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,830评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,628评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,537评论 2赞 269

Python爬虫之Pyspider框架实战

推荐阅读更多精彩内容