2019-07-05scrapy 爬虫框架搭建

安装scrapy包:
pip install scrapy
安装时会报错...如果是py3需要手动下载依赖包Twisted

image.png

下载地址:https://pypi.org/simple/twisted/
或者:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

下载后放在桌面Twisted-19.2.1-cp37-cp37m-win_amd64.whl

pip install C:\Users\Administrator\Desktop\Twisted-19.2.1-cp37-cp37m-win_amd64.whl

再次pip isntall scrapy,显示下面的表示依赖包都安装完成

D:\>pip install scrapy
Looking in indexes: https://mirrors.aliyun.com/pypi/simple/
Requirement already satisfied: scrapy in d:\python\lib\site-packages (1.6.0)
Requirement already satisfied: Twisted>=13.1.0 in d:\python\lib\site-packages (from scrapy) (19.2.1)
Requirement already satisfied: parsel>=1.5 in d:\python\lib\site-packages (from scrapy) (1.5.1)
Requirement already satisfied: PyDispatcher>=2.0.5 in d:\python\lib\site-packages (from scrapy) (2.0.5)
Requirement already satisfied: w3lib>=1.17.0 in d:\python\lib\site-packages (from scrapy) (1.20.0)
Requirement already satisfied: queuelib in d:\python\lib\site-packages (from scrapy) (1.5.0)
Requirement already satisfied: cssselect>=0.9 in d:\python\lib\site-packages (from scrapy) (1.0.3)
Requirement already satisfied: pyOpenSSL in d:\python\lib\site-packages (from scrapy) (19.0.0)
Requirement already satisfied: lxml in d:\python\lib\site-packages (from scrapy) (4.3.4)
Requirement already satisfied: service-identity in d:\python\lib\site-packages (from scrapy) (18.1.0)
Requirement already satisfied: six>=1.5.2 in d:\python\lib\site-packages (from scrapy) (1.12.0)
Requirement already satisfied: hyperlink>=17.1.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (19.0.0)
Requirement already satisfied: zope.interface>=4.4.2 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (4.6.0)
Requirement already satisfied: attrs>=17.4.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (19.1.0)
Requirement already satisfied: PyHamcrest>=1.9.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (1.9.0)
Requirement already satisfied: constantly>=15.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (15.1.0)
Requirement already satisfied: incremental>=16.10.1 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (17.5.0)
Requirement already satisfied: Automat>=0.3.0 in d:\python\lib\site-packages (from Twisted>=13.1.0->scrapy) (0.7.0)
Requirement already satisfied: cryptography>=2.3 in d:\python\lib\site-packages (from pyOpenSSL->scrapy) (2.7)
Requirement already satisfied: pyasn1-modules in d:\python\lib\site-packages (from service-identity->scrapy) (0.2.5)
Requirement already satisfied: pyasn1 in d:\python\lib\site-packages (from service-identity->scrapy) (0.4.5)
Requirement already satisfied: idna>=2.5 in d:\python\lib\site-packages (from hyperlink>=17.1.1->Twisted>=13.1.0->scrapy) (2.8)
Requirement already satisfied: setuptools in d:\python\lib\site-packages (from zope.interface>=4.4.2->Twisted>=13.1.0->scrapy) (40.8.0)
Requirement already satisfied: asn1crypto>=0.21.0 in d:\python\lib\site-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (0.24.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in d:\python\lib\site-packages (from cryptography>=2.3->pyOpenSSL->scrapy) (1.12.3)
Requirement already satisfied: pycparser in d:\python\lib\site-packages (from cffi!=1.11.3,>=1.8->cryptography>=2.3->pyOpenSSL->scrapy) (2.19)

如果显示下面图片,表示scrapy安装完成:
image.png

安装scrapy工程可以放在任意磁盘目录下
先切换到如D盘下,运行scrapy startproject Tencent

它的含义是用scrapy安装tencent工程,此时在D盘下生成Tencent文件夹
image.png

工程配置文件在d盘下的Tencent下的Tencent中
image.png

这个下面包含了scrapy框架的主要文件
  1. item.py:定义需要爬取的item,明确目标,如职位名称,工作地点等(需要自己设置)


    image.png
  2. middlewares.py,爬虫中间件 ,很少用到,创建后都已经自定义好,不需要更改(不需要自己设置)
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class TencentSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class TencentDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
  1. pipelines.py:管道文件,需要对文件格式存储方式做修改(需要自己配置)
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class TencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.csv","w",encoding='utf8')
    def process_item(self,item,spider):
        content=json.dumps(dict(item),ensure_ascii=False) + ",\n"
        self.f.write(content)
        return item
    
    def close_spider(self,spider):
        self.f.close()
  1. settings.py:对需要的设置做打开或者关闭处理默认大部分关闭,如需要打开管道设置:


    image.png

默认关闭状态

创建爬虫文件:scrapy genspider tencent "tencent.com"

我们需要编写的爬虫文件在spiders里面的tencent.py


image.png

tencent.py:

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
import json
class TencentSpider(scrapy.Spider):
    name = 'tencent'
    baseurl="https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    # allowed_domains = ['tencent.com']
    offset  = 1
  #   url='https://careers.tencent.com/tencentcareer/api/post/Query?'   
    start_urls=[baseurl.format(offset)]
    
    def parse(self, response):
        
        
        job_items=json.loads(response.body.decode())['Data']['Posts']
        
        for job_item in job_items:
            
            item = TencentItem()

            item['positionName'] = job_item["RecruitPostName"]
            
            item['positionLink'] = job_item["PostURL"] + job_item["PostId"]

            item['positionType'] = job_item["Responsibility"]

            item['worklocation'] = job_item["LocationName"]

            item['publishTime'] = job_item["LastUpdateTime"]

            yield item

        if self.offset < 430:

            self.offset += 1

            url = self.baseurl.format(self.offset)

            yield scrapy.Request(
            
            url = url,
   
            callback = self.parse
        )

运行爬虫:scrapy crawl tencent

11
D:\Tencent\Tencent\spiders>scrapy crawl tencent
2019-07-05 13:56:28 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: Tencent)
2019-07-05 13:56:28 [scrapy.utils.log] INFO: Versions: lxml 4.3.4.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 19.2.1, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2019-07-05 13:56:28 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Tencent', 'NEWSPIDER_MODULE': 'Tencent.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Tencent.spiders']}
2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet Password: 1f4cb6e4d1fc4caa
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-07-05 13:56:28 [scrapy.middleware] INFO: Enabled item pipelines:
['Tencent.pipelines.TencentPipeline']
2019-07-05 13:56:28 [scrapy.core.engine] INFO: Spider opened
2019-07-05 13:56:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-07-05 13:56:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2019-07-05 13:56:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://careers.tencent.com/404.html> from <GET https://careers.tencent.com/robots.txt>
2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.tencent.com/404.html> (referer: None)
2019-07-05 13:56:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn> (referer: None)
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013579229106176',
 'positionName': '22989-Serverless前端架构师',
 'positionType': '负责腾讯 Serverless 平台战略目标规划、整体平台产品能力设计;\n'
                 '负责探索前端技术与 Serverless 的结合落地,包括不限于腾讯大前端架构建设,公共组件的设计, '
                 'Serverless 的前端应用场景落地;\n'
                 '负责分析 Serverless 客户复杂应用场景的具体实现(小程序,Node.js);\n'
                 '负责 Serverless 场景中 Node.js 以及微信小程序相关生态建设。',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=01147013576054018048',
 'positionName': '22989-语音通信研发工程师(深圳)',
 'positionType': '负责腾讯云通信号码保护、企业总机、呼叫中心、融合通信产品开发;\n'
                 '负责融合通信PaaS平台的构建和优化;\n'
                 '负责通话质量分析和调优;',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231766955960606721123176695596060672',
 'positionName': '18435-合规反洗钱岗',
 'positionType': '1、根据反洗钱法律法规及监管规定的要求,完善落实反洗钱工作,指导各业务部门、分支机构开展反洗钱工作,支 持反洗钱监管沟通及监管报告反馈工作;\n'
                 '2、制定与完善内部反洗钱配套制度与流程,推动公司反洗钱标准化及流程化建设;\n'
                 '3、熟悉监管部门各项反洗钱政策制度要求,能就日常产品业务及合同及时进行反洗钱合规评审;\n'
                 '4、开展对各业务部门、分支机构的反洗钱合规自查工作,跟进缺陷问题;\n'
                 '5、根据反洗钱法律法规及监管规定的更新情况,及时对各业务部门进行法规解读,并追踪落实;\n'
                 '6、重点项目的跟进及推动工作;\n'
                 '7、领导交办的其他工作。',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳总部'}
2019-07-05 13:56:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1562249003305&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn>
{'positionLink': 'http://careers.tencent.com/jobdesc.html?postId=11231779032200683521123177903220068352',
 'positionName': '25927-游戏测试项目经理',
 'positionType': '负责项目计划和迭代计划的制定、跟进和总结回顾,推动产品需求、运营需求和技术需求的落地执行,排除障碍,确保交付时间和质量;\n'
                 '负责跟合作有关部门和团队对接,确保内部外部团队高效协同工作;\n'
                 '不断优化项目流程规范;,及时发现并跟踪解决项目问题,有效管理项目风险。',
 'publishTime': '2019年07月05日',
 'worklocation': '深圳总部'}
image.png

篇幅有限,只能图片展示


image.png

总结:
编写scrapy步骤:
scrapy project XXXX
scrapy genspider xxxx "xxx.com"
编写item.py 明确需要提取的数据
编写spider文件下面的xxxx.py编写爬虫文件处理,处理请求和响应,以及提取数据(yield item)
编写pipelines.py编写管道文件处理spider返回的item数据,比如本地持久化存储等...
编写settings.py启动管道组件 ITEM_PIPLELINES = {.....},以及其他相关设置
执行爬虫

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 160,444评论 4 365
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,867评论 1 298
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 110,157评论 0 248
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,312评论 0 214
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,673评论 3 289
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,802评论 1 223
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 32,010评论 2 315
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,743评论 0 204
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,470评论 1 246
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,696评论 2 250
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,187评论 1 262
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,538评论 3 258
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,188评论 3 240
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,127评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,902评论 0 198
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,889评论 2 283
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,741评论 2 274