如何编写一个Spider

本章以抓取 http://quotes.toscrape.com/ 为例，讲一下如何编写一个简单的spider

首先，我们要在项目目录下用命令创建一个spider，命令scrapy genspider quotes quotes.toscrape.com，该命令会在spiders目录下创建一个名为quotes.py的文件，其内容如下：

# -*- coding: utf-8 -*-
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

scrapy帮我们定义了一个类，其继承了scrapy.Spider，在类中，帮我们定义了一些参数和函数。

name：用于区别spider，一个项目下不能有相同名字的spider；
allowed_domain：可选的域名列表，其定义了我只能爬取的域名，如果没有这个参数，scrapy将不会限制爬取的域名；
start_urls：包含了scrapy爬取的起始地址列表，后续的爬取URL将会从scrapy中获取，后面会看到如果定义了start_requests函数，将会覆盖这个行为；
parse方法：这个是scrapy的默认回调函数，scrapy在下载完所爬取的页面后，会生成一个Response对象，然后回调parse函数，将Response作为parse函数的参数；该函数包含解析和处理页面的逻辑，并返回获取的数据(以item或dict形式返回)。

设置初始爬取点

除了上面所看到的 start_urls参数来设置起始点，还可以通过定义start_requests函数类设置爬取的起始点，如下：

# -*- coding: utf-8 -*-
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
#    start_urls = ['http://quotes.toscrape.com/']

    def start_requests(self):
        url = "http://quotes.toscrape.com/"
        yield scrapy.Request(url, callback = self.parse)

    def parse(self, response):
        print(response.body)

通过执行scrapy crawl quotes --nolog > quotes.html，可以看到在本地生成了一个html文件，里面包含了http://quotes.toscrape.com/中的内容。
其实，通过查看scrapy的源码，在父类scrapy.Scrapy实现了下面这一段代码：

for url in self.start_urls:
    yield Request(url, dont_filter=True)

因此，如果你想重写起始点爬取行为的话，可以实现自己的start_requests方法，否则，可以直接在start_urls要爬取的起始地址即可。
Request和Response分别代表了HTTP的请求包和相应包，下面简单描述下：

Request

scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])

url:请求地址，必填；
callback：页面解析回调函数，默认调用Spider的parse方法；
method：请求的方式，默认为GET；
headers：请求头部字典，如果某项的值为None，表示不发送该头部信息；
body：请求正文；
cookies：Cookies，通常为字典，也可以是字典的列表；
meta：这个比较重要，这个参数通常用于对象之间传递信息，比如在Response中会保存相对应的Request对象的meta参数；
priority：请求的优先默认值；
dont_filter：用来表示如果该地址之前请求过，本次是否过滤该请求；
errback：请求异常或出现HTTP错误时的回调函数。

Scrapy基于Response提供了一些子类，如FormRequest用来处理HTML FORM。

Response

scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None])

url：请求的地址；
status：HTTP相应状态码；
headers：相应头部；
body：相应正文，bytes类型；
request：对应的Request对象。

除了构造方法的参数，Response还需要关注以下参数/方法：
meta：来自Request.meta参数；
css(query)：使用CSS选择器从Response.body中提取信息，实际是TextResponse.selector.css(query)方法的快捷方式；
xpath(query)：使用XPath选择器从Response中提取信息，实际是TextResponse.selector.xpath(query)方法的快捷方式；
urljoin(url)：用于将相对路径转化为绝对路径。

使用Selector提取数据

知道页面是如何获取到之后，接下来就是如何从页面中获取所需要的信息。Scrapy提供了两种发生：CSS选择器和Xpath选择器(详细用法可以参考网上相关资料，这边只列举常用方法)。
先来看看Selector：

scrapy.selector.Selector(response=None, text=None, type=None)

response：可基于Response对象生成Selector对象；
text：可基于文本生成Selector，优先级 response > text；
type：解析类型，html/xml，通常不用关心。

Selector还提供了一下方法，下面简要介绍一些常用的方法：
xpath(query)：基于Xpath选择器提取数据，返回一个SelectorList元素；
css(query)：基于CSS选择器提取数据，返回一个SelectorList元素；
extract()：返回选中元素的Unicode字符串列表；
re(regex)：返回选中元素符合regex正则表达式的Unicode字符串列表；

再来看看SelectorList提供的方法:
xpath(query)：基于Xpath选择器对列表中的每个元素提取数据，所有结果会组成一个SelectList，并返回；
css(query)：基于CSS选择器对列表中的元素提取数据，所有结果会组成一个SelectList，并返回；
extract()：对列表中所有元素的调用extract()，所有结果会组成一个SelectList，并返回；
re(regex)：对列表中所有元素的调用re(regex)，所有结果会组成一个SelectList，并返回；
extract_first()：返回第一个元素的Unicode字符串列表；
re_first(regex)：返回符合regex正则的第一个元素的Unicode字符串列表；

下面简单看看CSS选择器和XPATH选择器

CSS选择器

CSS的用法可以看W3C
这边增加一条额外用法：

选择器	描述	例子
element::text	选择element元素的文本	p::text
element::attr(attr_name)	选择element元素属性为attr_name的值	a::attr(href)

示例：

>>> html_body = """
... <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
...         <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
...         <span>by <small class="author" itemprop="author">Albert Einstein</small>
...         <a href="/author/Albert-Einstein">(about)</a>
...         </span>
...         <div class="tags">
...             Tags:
...             <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
...
...             <a class="tag" href="/tag/change/page/1/">change</a>
...
...             <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
...
...             <a class="tag" href="/tag/thinking/page/1/">thinking</a>
...
...             <a class="tag" href="/tag/world/page/1/">world</a>
...
...         </div>
...     </div>
...     """
>>> selector = scrapy.Selector(text = html_body)
>>> type(selector)
<class 'scrapy.selector.unified.Selector'>
selector_list = selector.css('div.quote div.tags a.tag::text') #这边我们要提取的是文本直接a.tag::text也可以
>>> type(selector_list)
<class 'scrapy.selector.unified.SelectorList'>
>>> selector_list.extract_first()           #提取第一个tag
'change'
>>> selector_list.extract()                 #提取所有tag
['change', 'deep-thoughts', 'thinking', 'world']
>>> selector_list.re_first(r'(^w\w+)')      #利用正则提取
'world'

XPath选择器

XPath的用法可以看W3C
这边增加一条额外用法：

选择器	描述
text()	选择文本

//还是用上面的html_body
>>> selector_list = selector.xpath('//div[@class="quote"]/div[@class="tags"]/a[@class="tag"]/text()')

>>> selector_list.extract()
['change', 'deep-thoughts', 'thinking', 'world']
>>> selector_list.re_first(r'(^w\w+)')
'world'
>>> selector_list.extract_first()
'change'

爬取网站

通过结合Response、Request和Selector，就可以写出简单的爬虫。下面是以 http://quotes.toscrape.com/为例，爬取quote/author/tags并返回。

#-*- coding: utf-8 -*-
#quotes.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

#    def start_requests(self):
 #       url = "http://quotes.toscrape.com/"
  #      yield scrapy.Request(url, callback = self.parse)

    def parse(self, response):
        quote_selector_list = response.css('body > div > div:nth-child(2) > div.col-md-8 div.quote')

        for quote_selector in quote_selector_list:
            quote = quote_selector.css('span.text::text').extract_first()
            author = quote_selector.css('span small.author::text').extract_first()
            tags = quote_selector.css('div.tags a.tag::text').extract()

            yield {'quote':quote, 'author':author, 'tags':tags}
        #爬取下一页的链接
        next_page_url = response.css('ul.pager li.next a::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(next_page_url, callback = self.parse)

http://quotes.toscrape.com/开启了Robots，所以我们要在爬虫的配置文件中settings.py将ROBOTSTXT_OBEY = True改为ROBOTSTXT_OBEY = False。
运行命令：scrapy crawl quotes --nolog -o result.json，最终可以在目录下看到生成的文件result.json。

爬取结果.png

从上面的代码看到在第一步yield我们所需的数据之后，去爬取了下一页的链接，如果获取到下一页的链接，会再yield一个Request对象，Request对象的callback还是parse()方法，因此会一直爬取跌倒知道该网站没有下一页为止。

总结

这一篇博客描述了一个spider文件是如何运行以及如何爬取数据。下一篇博客将会讲如何使用Scrapy.item来封装爬取的数据。