Scrapy - 第一个爬虫和我的博客

第一个爬虫

这里我用官方文档的第一个例子：爬取http://quotes.toscrape.com来作为我的首个scrapy爬虫，我没有找到scrapy 1.5的中文文档，后续内容有部分是我按照官方文档进行翻译的（广告：要翻译也可以联系我，我有三本英文书籍的翻译出版经验，其中两本是独立翻译LOL），具体的步骤是：

在CMD中，进入你想要存储代码的目录下执行：scrapy startproject myspiders，其中quotes可以是你想要创建的目录名字。
Scrapy会自动创建一个名为myspiders的目录，并在它里面初始化一些内容。
进入myspiders/spiders目录，新建一个名为quotestoscrape.py的文件，并添加如下代码：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'quotestoscrape'

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

保存后，切回CMD，执行scrapy crawl quotestoscrape，在展示结果之前，我想先简单解释一下这部分的代码：

首先经过我的测试start_requests(self)这个方法并不是必须的，至少它也可以是一个名为start_urls[]的列表。不过我觉得还是遵循某种标准写法比较好。如果有的话，按照文档的说法，必须返回一个Requests的迭代器（它可以是一系列请求也可以是一个生成迭代器的方法），它代表了这个爬虫要从哪个或哪些地址开始爬取。同时也会同来进一步生成之后的请求。
每条请求都会从服务器下载下来一些内容，parse()方法是用来处理这些内容的。参数response包含了整个页面的内容，之后你可以使用其他函数方法来进一步处理它。
yield关键字代表了Python另一个特性：生成器。我忽然想到似乎我从来没提到过它，虽然我知道这是什么。以后有机会在写一写吧。

指令执行后，都会输出一大堆的log，大多数不难理解，我这里只截取其中我们想看的一部分，其中前半部分是爬取到的结果，后面一部分是一个统计：

....
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider opened
2018-04-19 15:56:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-19 15:56:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin'}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-19 15:56:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 908603),
 'item_scraped_count': 10,
 'log_count/DEBUG': 13,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 400951)}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider closed (finished)

以后如果有空会专门写一篇文档把这部分日志展开来说一说。

error: No module named win32api

在最后执行的时候，有可能会出现找不到win32api的错误，安装如下模块即可：pip install pypiwin32。

进一步处理response

初次接触爬虫，可能会对上述代码中的response.css(), quote.css(), quote.xpath()和extract_first()感到陌生，这些就是所谓的进一步处理response的方法。

这部分内容需要用到一些HTML/CSS的知识，你需要知道通过怎样的表达式才能从返回内容中获取到你需要的内容。因为网页的代码都是树形结构，理论上通过合理的表达式，我们可以获取任何我们想要获得的内容。通常情况下，我们有两种方法可以计算出我们的表达式：

第一种是用浏览器的审查模式。
第二种是利用scrapy提供的命令行模式。

CSS选择器

上述代码中，response.css('div.quote')和quote.css('span.text::text')都是CSS选择器。如果我们打开该网页的元素审查页面，会有如下结果：

Python爬虫CSS选择器.jpg

依我之见，流程大概如下：利用屏幕底下几个标签可以先定位到一个大概的位置，比如说quote = response.css('div.quote')定位到图中蓝框的位置，之后我们要进行进一步的筛选，我没有找到文档说明应如何进行筛选，这里是我的一点经验之谈：如果是html标签用空格分割，如果标签带class标识，则用.连接，最后再加上::text 用来剔除首尾的<>标识。

在整个过程中，我们都可以用scrapy的命令行来测试，在你的CMD下输入：scrapy shell "http://quotes.toscrape.com/"。之后出现一大推日志和一些可用的指令：

D:\OneDrive\Documents\Python和数据挖掘\code\blogspider>scrapy shell "http://quotes.toscrape.com/"
.............省略.............
2018-04-19 18:28:19 [scrapy.core.engine] INFO: Spider opened
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000029D0C61AC50>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/>
[s]   response   <200 http://quotes.toscrape.com/>
[s]   settings   <scrapy.settings.Settings object at 0x0000029D0ED439B0>
[s]   spider     <DefaultSpider 'default' at 0x29d0efecc18>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

我们主要用到的是response对象，之后我们就可以进行调试，如下：

# 定位这个网站的标题，extract()用来获取其中的data
>>> response.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='Quotes to Scrape'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']

# 定位到作者信息，这是最完整的写法
>>> response.css("div.quote span small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以简单一点
>>> response.css("div span small::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以拆开来写
>>> response.css("div.quote").css("span").css("small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 只需要第一项？
>>> response.css("div.quote").css("span").css("small.author::text")[0].extract()
'Albert Einstein'
>>> response.css("div.quote").css("span").css("small.author::text").extract_first()
'Albert Einstein'

如果你之前自己写过网站的CSS，这些其实还是很好理解的，因为内在的逻辑是一样的，伴随这个命令行指令自己琢磨琢磨很容就就能掌握。如果你仔细看，会发现这个函数返回的其实是个列表，这点可以方便我们写代码。

XPath选择器

另一种方法是使用XPath选择器，如上文中的代码：quote.xpath('span/small/text()')。根据文档的描述，XPath才是Scrapy的基础，事实上，即使是CSS选择器最终也会在底层被转化为XPath。XPath比CSS选择强大的地方在于它还可以对筛选出的网页的内容本身就行操作，比如说它可以进行诸如选择那个内容为（下一页）的链接的操作。官方提供了三个关于XPath的文档：using XPath with Scrapy Selectors，learn XPath through examples和how to think in XPath。

保存数据

这个只是一行命令的事，比如说我要将上文爬虫的内容写入一个json文件，我只需要在cmd中执行：

scrapy crawl quotes -o data.json

-o应该就是output，这个linux命令很像，不难理解。当然也可以是其他格式的文件，官方推荐一个叫JSON Lines的格式，虽然我目前还不知道这是什么格式。

所有指出的到处数据类型为：'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'。

爬取下一页的数据

像http://quotes.toscrape.com这个网站，它可以分为好几页，我们可以通过解析网页中的“下一个”按钮的链接来爬取下一页，下一页的下一页，...，的内容，直到没有下一页了。代码不难理解，直接放上了：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'quotestoscrape'

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

爬取我自己的博客

说了这么多，做点实际的，我想爬取一下我自己博客的所有文章和发布时间，代码如下：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'ethanshub'
    start_urls = [
        'https://journal.ethanshub.com/archive',
    ]

    def parse(self, response):
        yearlists = response.css('ul.listing')

        for i in range(len(yearlists)):
            lists = yearlists[i]

            for j in range(len(lists.css("li.listing_item"))//2):
                yield {
                    'date': lists.css("li.listing_item::text")[j*2].extract(),
                    'title': lists.css("li.listing_item a::text")[j].extract(),
                }

这里唯一要注意的是要注意不要只爬取了一年的文章，要准确找到能包含所有文章的最小结构。然后就是简单的逻辑性操作了。另外值得一提的一点是，我的博客使用的是Bitcron，CSS文件也是后台渲染的并且我也是按照其语法规则编写CSS的，但是我在分析过程中发现lists.css("li.listing_item")对于每一项都会多爬取到一个空白字段，这也就导致了最后取出的date数量总是title数量的两倍，好在这也保证了date数量肯定是双数，代码略微调整一下即可。

在执行scrapy crawl ethanshub -o data.json之后抓取到的data.json文件内容如下：

[
{"date": "[2017-12-16]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e8c\uff09"},
{"date": "[2017-12-15]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e00\uff09"},
{"date": "[2017-12-13]\n", "title": "\u7528Python\u5411Kindle\u63a8\u9001\u7535\u5b50\u4e66"},
{"date": "[2017-12-12]\n", "title": "GUI\u7f16\u7a0b\uff0cTkinter\u5e93\u548c\u5e03\u5c40"},
{"date": "[2017-12-12]\n", "title": "Python3\u7684\u6b63\u5219\u8868\u8fbe\u5f0f"},
{"date": "[2017-12-10]\n", "title": "Python\u901f\u89c8[7]"},
{"date": "[2017-12-09]\n", "title": "Python\u901f\u89c8[6]"},
....
{"date": "[2013-09-16]\n", "title": "How to split a string in C"},
{"date": "[2012-11-28]\n", "title": "Common Filters for Wireshark"}
]

一切OK，其中\u722c是Unicode的中文字符，只是个编码问题，就不多做了。

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,233评论 4赞 360
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,013评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,030评论 0赞 241
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,827评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,221评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,542评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,814评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,513评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,225评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,497评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,998评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,342评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,986评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,055评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,812评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,560评论 2赞 271
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,461评论 2赞 266