Scrapy学习爬虫实战记录-入门(一)

今天是2016年6月26日,开始学习爬虫。

软件包使用Scrapy。

已经在linux虚拟机下安装了anaconda3,安装Scrapy,版本为1.1。

以这个网址作为https://doc.scrapy.org/en/1.1/intro/tutorial.html做为教程。

以前用过爬虫,但非常简单,现在需要爬取天气,地震,突发事件等,尝试使用scrapy来获取信息。

首先建一个项目,用拼音,取名tianqi

具体方法如下:

scrapy startproject tianqi

发现我的home目录下已经有tianqi这个目录。

进入这个目录,列出目录内容:

[root@wangqi tianqi]# ls -l

total 8

-rw-rw-r-- 1 eyeglasses root  256 Jun 26 14:21 scrapy.cfg

drwxrwxr-x 4 eyeglasses root 4096 Jun 26 14:21 tianqi

有一个目录和一个文件。

文件的名称是scrapy.cfg,看得出来,这个是配置文件,看看里面有些什么内容。

------------------------------------------------------------------------------------------

# Automatically created by: scrapy startproject

#

# For more information about the [deploy] section see:

# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]

default = tianqi.settings

[deploy]

#url = http://localhost:6800/

project = tianqi

----------------------------------------------------------------------------------------------------------

感觉没有什么东西。

看看目录tianqi,看看里面有什么内容:

[root@wangqi tianqi]# ls -l

total 20

-rw-rw-r-- 1 eyeglasses root    0 Jul 14  2016 __init__.py

-rw-r--r-- 1 eyeglasses root  285 Jun 26 14:21 items.py

-rw-r--r-- 1 eyeglasses root  286 Jun 26 14:21 pipelines.py

drwxrwxr-x 2 eyeglasses root 4096 Jun 26 15:13 __pycache__

-rw-r--r-- 1 eyeglasses root 3128 Jun 26 14:21 settings.py

drwxrwxr-x 3 eyeglasses root 4096 Jun 26 15:49 spiders

有4个文件和2个目录,文件分别为:__init__.py ,items.py ,pipelines.py, settings.py,后缀都是py,看来是python源码,看看里面的内容。

[root@wangqi tianqi]#vi __init__.py 

执行上面的命令,发现里面什么都没有,在python模块的每一个包中,都有一个__init__.py文件(这个文件定义了包的属性和方法)如果里面什么都没有,那么你也可以将这个目录作为一个包,作为模块导入。

---------------------------------------------------------------------

[root@wangqi tianqi]# vi items.py #Items是保存爬取到的数据的容器

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TianqiItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

pass

--------------------------------------------------------------------------------

继续看pipelines.py,这个是管道

--------------------------------------------------------------

[root@wangqi tianqi]# vi pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class TianqiPipeline(object):

def process_item(self, item, spider):

return item

--------------------------------------------------------------------

看看settings.py里面的内容,可以看见全局配置,还有robots.txt rules,robots协议定义了网站允许爬虫爬行的范围。

[root@wangqi tianqi]# vi settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for tianqi project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#    http://doc.scrapy.org/en/latest/topics/settings.html

#    http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

#    http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'tianqi'

SPIDER_MODULES = ['tianqi.spiders']

NEWSPIDER_MODULE = 'tianqi.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'tianqi (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

#  'Accept-Language': 'en',

#}

# Enable or disable spider middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

现在我们来看看目录,有两个目录,先看__pycache__,

当第一次运行 python 脚本时,解释器会将 *.py 脚本进行编译并保存到 __pycache__ 目录

下次执行脚本时,若解释器发现你的 *.py 脚本没有变更,便会跳过编译一步,直接运行保存在 __pycache__ 目录下的 *.pyc 文件

-----------------------------------------------------------------

[root@wangqi __pycache__]# ls -l

total 8

-rw-r--r-- 1 eyeglasses root 125 Jun 26 15:13 __init__.cpython-35.pyc

-rw-r--r-- 1 eyeglasses root 240 Jun 26 15:13 settings.cpython-35.pyc

---------------------------------------------------------------------------------

关闭 pycache

单次关闭: 运行脚本时添加 -B 参数即可

永久关闭: 设置环境变量 PYTHONDONTWRITEBYTECODE=1 即可

剩下最后一个目录,spiders,这个目录是用来放代码的目录,比如你要爬行哪个网站,就取名,我一般是按照域名来命名,便于记忆。

[root@wangqi spiders]# ls -l

total 12

-rw-rw-r-- 1 eyeglasses root  161 Jul 14  2016 __init__.py

drwxrwxr-x 2 eyeglasses root 4096 Jun 26 15:21 __pycache__

-rw-r--r-- 1 eyeglasses root  589 Jun 26 15:10 weather_spider.py

这里面有三个文件,我自己建了一个weather_spider.py,用来爬行天气网站,里面是爬行代码,而__init__.py与__pycache__上面有详细说明。

---------------------------------------------------------------------------

[root@wangqi spiders]# vi weather_spider.py

#!/bin/env python

# -*- coding:utf-8 -*-

import scrapy

class WeatherSpider(scrapy.Spider):

name = "weather"

def start_requests(self):

urls = [

'http://sc.weather.com.cn/chengdu/index.shtml',

'http://sc.weather.com.cn/neijiang/index.shtml',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'weather-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

-----------------------------------------------------------------------------------------------

开始运行,输入命令:

[eyeglasses@wangqi spiders]$ scrapy crawl weather

2017-06-27 10:18:31 [scrapy] INFO: Scrapy 1.1.1 started (bot: tianqi)

2017-06-27 10:18:31 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tianqi.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tianqi', 'SPIDER_MODULES': ['tianqi.spiders']}

2017-06-27 10:18:31 [scrapy] INFO: Enabled extensions:['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats']

2017-06-27 10:18:31 [scrapy] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2017-06-27 10:18:31 [scrapy] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']

2017-06-27 10:18:31 [scrapy] INFO: Enabled item pipelines:[]

2017-06-27 10:18:31 [scrapy] INFO: Spider opened

2017-06-27 10:18:31 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2017-06-27 10:18:31 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:60232017-06-27 10:18:31 [scrapy] DEBUG: Redirecting (302) tofrom

2017-06-27 10:18:31 [scrapy] DEBUG: Crawled (200)(referer: None)

2017-06-27 10:18:32 [scrapy] DEBUG: Crawled (200)(referer: None)2017-06-27 10:18:32 [weather] DEBUG: Saved file quotes-neijiang.html

2017-06-27 10:18:32 [scrapy] DEBUG: Crawled (200)(referer: None)

2017-06-27 10:18:32 [weather] DEBUG: Saved file quotes-chengdu.html

2017-06-27 10:18:32 [scrapy] INFO: Closing spider (finished)

2017-06-27 10:18:32 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 969,

'downloader/request_count': 4,

'downloader/request_method_count/GET': 4,

'downloader/response_bytes': 34374,

'downloader/response_count': 4,

'downloader/response_status_count/200': 3,

'downloader/response_status_count/302': 1,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2017, 6, 27, 2, 18, 32, 553229),

'log_count/DEBUG': 7,

'log_count/INFO': 7,

'response_received_count': 3,

'scheduler/dequeued': 2,

'scheduler/dequeued/memory': 2,

'scheduler/enqueued': 2,

'scheduler/enqueued/memory': 2,

'start_time': datetime.datetime(2017, 6, 27, 2, 18, 31, 471145)}

2017-06-27 10:18:32 [scrapy] INFO: Spider closed (finished)

查看目录,发现有html文件生成,成功。

推荐阅读更多精彩内容