爬虫实战2--微博爬取

本文承接上一篇爬虫开篇的说明----上一篇已经很好的用到了reqquests,Beautifulsoup等库,以及爬虫的常用更简单框架;本篇内容的目的是充分的认识scrapy 框架的各个组件,以及利用scrapy 框架实现微博的爬取

开篇之前,先来概览一下scrapy 框架的架构

scrapy 架构
1. Engine 引擎,触发事务,是整个框架的核心部分
2.scheduler   调度器,将引擎发来的请求加入到队列当中
3.Dowloader  下载器 接受调度器的requests 并将网页内容Response 返回给spider 
4.spider   代码的主要部分,用于存储代码的主要逻辑,网页的解析规则,
5.item Pipline 项目管道  主要的作用是清晰,存储数据
6.Downloader  Middlewares   下载器中间件,位于引擎和下载器之间的钩子框架,主要是处理引擎与下载器之间的请求和相应
7. Spider Middlewares  位于引擎和spider 之间的钩子框架,主要处理Spider 输入的相应和输出结果以及新的请求

运行环境
win8, python3, pycharm, Scrapy框架, MongoDB , PyMongo 库
后续的分布式爬虫和验证码还会用到
Redis,PIL

1. 创建scrapy 项目
  • 打开pycharm 的Terminal 输入 命令
    scrapy startproject weibo
    *给创建的Spider 命名 且定于要爬取的本地网址
    输入scrapy genspider weibospider weibo.cn
    前一个为name,后一个为要访问的base_url
示意图

这个时候,创建的类目下就会多一个weibo.py的文件,打开即为如下示意图


示意图pycharm
2.scrapy 选择器 Selector

上一篇爬虫用到了 BeautifulSoup ,Pyquery 而Scrapy 提供了自己的数据提取方法 Selector, 该选择器主要是基于lxml来构建的,支持xpath 选择器,CSS选择器;接下来用官方文档的例子来演示选择器的使用
网址:http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
然后在 pycharm 的 Terminal 里输入 scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
这样就进入了python 环境下,且进行了请求, response.text 可以输出html

演示

关于Selector 选择器的用法总结如下

  • result=response.selector.xpath('//a') 当然省略selector 也是可以的,result=response.xpath('//a')与上边等效 ;该方法是取出所有根节点以下的a标签
  • result.xpath('./img') 用于取出a标签下的image标签 如果没有加点,则代表从根节点开始提取,此处用了./img,表示从a 节点进行提取,如果用//img 则表示从根节点提取
  • 上述的result 类型都是selector, 如何变为字符串并且放在一个列表当中呢? 用result.xpath('./img').extract() ,随后则可以进行一系列列表的处理
  • 怎么样可以取出一个标签里面的内容呢? result.xpath('//a/text()').extract() 用text()即可
  • 怎么样获取标签的属性值呢?
    result.xpath('//a/@href').extract()
  • 上述是获得的所有的属性,但是如果想要获得某一个属性怎么做?
    result.xpath('//a[@href="image1.html"]').extract()
    需要注意的是,里面要用双引号,否则报错,其次不建议用extract()来取单个元素,代替的是用 extract_first()
    result.xpath('//a[@href="image1.html"]').extract_first()
    想要得到文本?加个text()即可
    result.xpath('//a[@href="image1.html"]/text()').extract_first()
  • 下面介绍一下selector中的css 选择器
    如何像上述xpath一样选取a 标签呢? 并且以列表的形式存储呢?
    result.css('a').extract()
    如何选取 a 标签下的子标签 img 呢?只需要简单的空格
    result.css('a img').extract()
    如何选择 某一个属性a 标签下的子标签呢?
    result.css('a[href="image1.html"] img').extract_first()
    如何选择属性和文本内容呢?
    result.css('a[href="image1.html"] img::attr(src)').extract_first()
    result.css('a[href="image1.html"]::text').extract_first()
    除了css,xpath 选择器之外,还提供了 正则匹配




    那如何选取头一个元素呢?



    值得注意的是 result 不能直接与re(),re_first() 一起使用,否则报错,必须与css,xpath 连用
3. scrapy 几个主要框架的介绍
  • spider 模块

spider 是scrapy 框架当中最重要也是最基础的一个模块,其作用就是存储爬取、解析等主要的代码逻辑,该模块的循环步骤或者说是运转逻辑如下:

1.以初始化的url初始Request,并设置一个回调函数,当该Request 成功请求并返回时,Response 生成并作为参数传给该回调函数

  1. 在设定的回调函数内分析请求的正常返回内容,有两种形式,一种为得到的有效结果返回字典或者item对象,他们可以经过处理后直接保存,另一种是解析得到的下一个链接,可以利用此链接构造Request 并设置新的回调函数
    3.如果返回的是字典或者item的化,同时设置了pipline的话,我们可以使用pipline处理并且保存
    4.如果返回的是request,那么request 执行成功得到response之后,Response会被传递给request中定义的新回调函数,在回调函数中我们可以利用上述的selector 选择器来解析内容,并根据内容生成item

spider 这个类中最常见的就是scrapy.spiders.spider,它提供了starts_requests()方法,它有以下属性:

属性:name 爬虫名称, allowed_domains 所要爬取的域名,starts_urls 是起始的urls
方法:
1.start_requests() 此方法用于生成初始请求,它必须返回一个可以迭代的对象,该方法默认使用starts_urls里面的URL来构造Requests,而且Request是get 请求方式,如果想要发起post请求,就必须使用FormRequest即可

  1. parse() 当response没有指定回调函数的时候,该方法会被默认调用
    3.closed() 当spider g关闭时会被调用

*Downloader Middleware 模块
从文章开始的示意图可以了解到,当scheduler 从队列中拿出一个Request 发送给Downloader 执行下载,这个过程会经过DownloaderMiddleware 处理,另外,当Downloader 将Request 下载完成得到Response 返回spider的时候,也会经过该模块

该模块主要有两个方法

1.process_request()
Requset 被调度器Scheduler调度给Downloader 之前,Process_request()方法就会被调用,也就是在Request 从队列中调度出来到Downloader 下载执行之前,我们都可以用process_request()方法进行处理,比如说设定user_agent,cookies等,但是当设定user_agent 的时候推荐直接在插件setting.py中加入
USER_AGENT='XXXXXXX'即可

2.process_response()
Downloader 执行Request 后,会得到Response,Scrapy 引擎便会将Response 发送给spider进行解析,发送之前,可以用process_response()方法来进行解析

下面来讲一个demo,其中的user_agent 是随机选取的一个,可以根据自己爬取的网站自行替换,方法一样,在middleware.py 中加入下述demo

import random 

class randomuseragentmiddleware(object):



    @classmethod
    def __init__(self):
        self.user_agents=['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36']
        """
        上述的user_agent 可以是多个,放在一个列表当中
        """
    def process_request(self,request,spider):
        request.headers['User_Agent']=random.choice(self.user_agents)

随后在settings.py中找到DOWNLOADER_MIDDLEWARES的注释,去掉注释,改为下面语句

'weibo.middlewares.randomuseragentmiddleware: 543',
其中,weibo 就是上述startproject后面的文件名称,最后的就是类名,自己根据项目修改即可

*spider_middleware 模块
当Downloader 生成Response 之后,Response 会被发送给Spider,在发送给Spider 之前,会首先经过Spider Middleware 处理,当spider 处理生成Item,Request 之后,Item和Request 还会经过Spider Middleware 的处理

  • Item_pipline模块

Item Pipline 主要的功能有4处
1.清理HTML数据
2.验证爬取数据,检查爬取字段
3.查重并丢弃重复字段
4.将爬取的结果保存到数据库

下面进行微博的爬取实例

因为微博的反爬非常厉害,在进行爬取之前,需要建立一个cookies池,这里利用“拿来主义”,强烈推荐使用崔大佬现成cookies池

网址:https://github.com/Python3WebSpider/CookiesPOOL

将文件下载解压后,使用方法如下面文章
网址:https://blog.csdn.net/qq_38661599/article/details/80945233

进行了上述的预备工作后,分析一下网页结构,首先登陆https://m.weibo.cn,随后登陆,进入一个用户的界面,调度开发者工具,观察XHR,ajax异步存储(动态页面json形式),getindx开头的ajax 请求,返回的结果就是该用户每页的信息,preview中可以观察为json形式

下面是代码:

第一部分是spider.py

import scrapy
import json
from scrapy import Request,Spider
from weibo.items import *
class WeibospiderSpider(scrapy.Spider):
    name = 'weibospider'
    allowed_domains = ['m.weibo.cn']
    start_urls = ['http://weibo.cn/']
    user_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&value={uid}&containerid=100505{uid}'

    follow_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_followers_-_{uid}&page={page}' #爬取用户关注的人

    fan_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_fans_-_{uid}&page={page}'  #爬取用户的粉丝

    weibo_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&page={page}&containerid=107603{uid}'

    start_users = ['1977459170','1742566624']#存储uid 的列表,可以多个


    def start_requests(self):
        for uid in self.start_users:
            yield Request(self.user_url.format(uid),callback=self.parse_user)

    def parse_user(self, response):  # 该response 是由上述所讲的downloadmiddle和downloader 而来的
        result=json.loads(response.text)
        if result.get('data').get('userInfo'):#该用户的信息就存储在该key下面
            user_info=result.get('data').get('userInfo')
            user_item=UserItem()#调用实例化item里面的UserItem类
            field_map = {
                'id': 'id', 'name': 'screen_name', 'avatar': 'profile_image_url', 'cover': 'cover_image_phone',
                'gender': 'gender', 'description': 'description', 'fans_count': 'followers_count',
                'follows_count': 'follow_count', 'weibos_count': 'statuses_count', 'verified': 'verified',
                'verified_reason': 'verified_reason', 'verified_type': 'verified_type'
            }#要提取的所有的键名称
            for field,attr in field_map.items():
                user_item[field]=user_info.get(attr)
            yield user_item

            uid=user_info.get('id')
            yield Request(self.follow_url.format(uid, page=1), callback=self.parse_follows)
            yield Request(self.fan_url.format(uid, page=1), callback=self.parse_fans)
            yield Request(self.weibo_url.format(uid, page=1), callback=self.parse_weibos)

    def parse_follows(self,response):#上述的请求给到downloadermiddleware 后,由downloader 直接返回response
        result=json.loads(response.text)#json 解析
        if result.get('ok') and result.get('data').get('cards') and len(result.get('data').get('cards')) and \
                result.get('data').get('cards')[-1].get(
                        'card_group'):
            # 解析用户
            follows = result.get('data').get('cards')[-1].get('card_group')
            for follow in follows:
                if follow.get('user'):
                    uid = follow.get('user').get('id')
                    yield Request(self.user_url.format(uid=uid), callback=self.parse_user)

            uid = response.meta.get('uid')
        # 关注列表
            user_relation_item = UserRelationItem()
            follows = [{'id':follow.get('user').get('id'), 'name': follow.get('user').get('screen_name')} for follow in follows]
            user_relation_item['id'] = uid
            user_relation_item['follows'] = follows
            user_relation_item['fans'] = []
            yield user_relation_item
        # 下一页关注
            page = response.meta.get('page') + 1
            yield Request(self.follow_url.format(uid=uid, page=page),callback=self.parse_follows, meta={'page': page, 'uid': uid})

    def parse_fans(self, response):
        """
        解析用户粉丝
        :param response: Response对象
        """
        result = json.loads(response.text)
        if result.get('ok') and result.get('data').get('cards') and len(result.get('data').get('cards')) and \
                result.get('data').get('cards')[-1].get(
                        'card_group'):
            # 解析用户
            fans = result.get('data').get('cards')[-1].get('card_group')
            for fan in fans:
                if fan.get('user'):
                    uid = fan.get('user').get('id')
                    yield Request(self.user_url.format(uid=uid), callback=self.parse_user)

            uid = response.meta.get('uid')
            # 粉丝列表
            user_relation_item = UserRelationItem()
            fans = [{'id': fan.get('user').get('id'), 'name': fan.get('user').get('screen_name')} for fan in
                    fans]
            user_relation_item['id'] = uid
            user_relation_item['fans'] = fans
            user_relation_item['follows'] = []
            yield user_relation_item
            # 下一页粉丝
            page = response.meta.get('page') + 1
            yield Request(self.fan_url.format(uid=uid, page=page),
                          callback=self.parse_fans, meta={'page': page, 'uid': uid})

    def parse_weibos(self, response):
        """
        解析微博列表
        :param response: Response对象
        """
        result = json.loads(response.text)
        if result.get('ok') and result.get('data').get('cards'):
            weibos = result.get('data').get('cards')
            for weibo in weibos:
                mblog = weibo.get('mblog')
                if mblog:
                    weibo_item = WeiboItem()
                    field_map = {
                        'id': 'id', 'attitudes_count': 'attitudes_count', 'comments_count': 'comments_count',
                        'reposts_count': 'reposts_count', 'picture': 'original_pic', 'pictures': 'pics',
                        'created_at': 'created_at', 'source': 'source', 'text': 'text', 'raw_text': 'raw_text',
                        'thumbnail': 'thumbnail_pic',
                    }
                    for field, attr in field_map.items():
                        weibo_item[field] = mblog.get(attr)
                    weibo_item['user'] = response.meta.get('uid')
                    yield weibo_item
            # 下一页微博
            uid = response.meta.get('uid')
            page = response.meta.get('page') + 1
            yield Request(self.weibo_url.format(uid=uid, page=page), callback=self.parse_weibos,
                          meta={'uid': uid, 'page': page})

middleware.py 用于存放对接cookies池的代码

import json
import logging
from scrapy import signals
import requests


class ProxyMiddleware():
    def __init__(self, proxy_url):
        self.logger = logging.getLogger(__name__)
        self.proxy_url = proxy_url
    
    def get_random_proxy(self):
        try:
            response = requests.get(self.proxy_url)
            if response.status_code == 200:
                proxy = response.text
                return proxy
        except requests.ConnectionError:
            return False
    
    def process_request(self, request, spider):
        if request.meta.get('retry_times'):
            proxy = self.get_random_proxy()
            if proxy:
                uri = 'https://{proxy}'.format(proxy=proxy)
                self.logger.debug('使用代理 ' + proxy)
                request.meta['proxy'] = uri

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(
            proxy_url=settings.get('PROXY_URL')
        )


class CookiesMiddleware():
    def __init__(self, cookies_url):
        self.logger = logging.getLogger(__name__)
        self.cookies_url = cookies_url
    
    def get_random_cookies(self):
        try:
            response = requests.get(self.cookies_url)
            if response.status_code == 200:
                cookies = json.loads(response.text)
                return cookies
        except requests.ConnectionError:
            return False
    
    def process_request(self, request, spider):
        self.logger.debug('正在获取Cookies')
        cookies = self.get_random_cookies()
        if cookies:
            request.cookies = cookies
            self.logger.debug('使用Cookies ' + json.dumps(cookies))
    
    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(
            cookies_url=settings.get('COOKIES_URL')
        )

pipline.py 存储的是时间的清洗代码,以及存储代码,返回spider.py中的item 信息

import re, time

import logging
import pymongo

from weibo.items import *


class TimePipeline():
    def process_item(self, item, spider):
        if isinstance(item, UserItem) or isinstance(item, WeiboItem):
            now = time.strftime('%Y-%m-%d %H:%M', time.localtime())
            item['crawled_at'] = now
        return item


class WeiboPipeline():
    def parse_time(self, date):
        if re.match('刚刚', date):
            date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time()))
        if re.match('\d+分钟前', date):
            minute = re.match('(\d+)', date).group(1)
            date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(minute) * 60))
        if re.match('\d+小时前', date):
            hour = re.match('(\d+)', date).group(1)
            date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(hour) * 60 * 60))
        if re.match('昨天.*', date):
            date = re.match('昨天(.*)', date).group(1).strip()
            date = time.strftime('%Y-%m-%d', time.localtime() - 24 * 60 * 60) + ' ' + date
        if re.match('\d{2}-\d{2}', date):
            date = time.strftime('%Y-', time.localtime()) + date + ' 00:00'
        return date
    
    def process_item(self, item, spider):
        if isinstance(item, WeiboItem):
            if item.get('created_at'):
                item['created_at'] = item['created_at'].strip()
                item['created_at'] = self.parse_time(item.get('created_at'))
            if item.get('pictures'):
                item['pictures'] = [pic.get('url') for pic in item.get('pictures')]
        return item


class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )
    
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.db[UserItem.collection].create_index([('id', pymongo.ASCENDING)])
        self.db[WeiboItem.collection].create_index([('id', pymongo.ASCENDING)])
    
    def close_spider(self, spider):
        self.client.close()
    
    def process_item(self, item, spider):
        if isinstance(item, UserItem) or isinstance(item, WeiboItem):
            self.db[item.collection].update({'id': item.get('id')}, {'$set': item}, True)
        if isinstance(item, UserRelationItem):
            self.db[item.collection].update(
                {'id': item.get('id')},
                {'$addToSet':
                    {
                        'follows': {'$each': item['follows']},
                        'fans': {'$each': item['fans']}
                    }
                }, True)
        return item

最后是setting.py 用于存放声明

# -*- coding: utf-8 -*-

# Scrapy settings for weibo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'weibo'

SPIDER_MODULES = ['weibo.spiders']
NEWSPIDER_MODULE = 'weibo.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'weibo (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'application/json, text/plain, */*',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2,mt;q=0.2',
    'Connection': 'keep-alive',
    'Host': 'm.weibo.cn',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'weibo.middlewares.WeiboSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'weibo.middlewares.CookiesMiddleware': 554,
    'weibo.middlewares.ProxyMiddleware': 555,
}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'weibo.pipelines.TimePipeline': 300,
    'weibo.pipelines.WeiboPipeline': 301,
    'weibo.pipelines.MongoPipeline': 302,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


MONGO_URI = 'localhost'

MONGO_DATABASE = 'weibo'

COOKIES_URL = 'http://localhost:5000/weibo/random'

PROXY_URL = 'http://localhost:5555/random'

RETRY_HTTP_CODES = [401, 403, 408, 414, 500, 502, 503, 504]

上述内容主要是对于scrapy 框架有一个概览,认识各个模块的使用方法,作用是什么,下一次会用scrapy 进行一个完整的典行实例操作爬取京东

推荐阅读更多精彩内容