简单爬虫项目（阿里巴巴童装供应商信息）+数据后处理

一、环境搭建

谷歌浏览器
谷歌浏览器下载
运行 dpkg -i google-chrome-stable_current_amd64.deb
webdriver
1. 查询你的谷歌浏览器版本下载的时候就有版本信息，或者可以在谷歌浏览器上输入 chrome://version 获得版本信息
2. 根据版本号在下载chromedriver
3. 解压文件
  使用命令将解压后的文件复制到 /usr/local/bin/chromedriver
  sudo mv chromedriver /usr/local/bin/chromedriver
  改变用户权限：
  sudo chmod u+x,o+x /usr/local/bin/chromedriver
  4.配置成功，使用 chromedriver --version 可看到版本号
Python3
1. Linux自带多个版本Python ls /usr/bin/python*
  
  绿色文件：可执行文件
  蓝色文件：目录
  白色文件：一般性文件，如文本文件，配置文件，源码文件等
2. 切换版本号
  2.1. update-alternatives --list python
  
  2.2. update-alternatives: error: no alternatives for python
  如果出现以上所示的错误信息，则表示 Python 的替代版本尚未被 update-alternatives 命令识别。想解决这个问题，我们需要更新一下替代列表，将 python2.7 和 python3.5 放入其中。
  
  2.3. update-alternatives --install /usr/bin/python python /usr/bin/python2.7 1
  update-alternatives --install /usr/bin/python python /usr/bin/python3.5 2
  -install 选项使用了多个参数用于创建符号链接。最后一个参数指定了此选项的优先级，如果我们没有手动来设置替代选项，那么具有最高优先级的选项就会被选中。这个例子中，我们为 /usr/bin/python3.5 设置的优先级为2，所以 update-alternatives 命令会自动将它设置为默认 Python 版本。
  
  2.4.
  
  2.5 之后如果想切换版本使用sudo update-alternatives --config python
  
  2.6 另外一种切换版本的方法

Ubuntu Python2切换到Python3

echo alias python=python3 >> ~/.bashrc
source ~/.bashrc
Ubuntu Python3切换到Python2

gedit ~/.bashrc
alias python=python3
source ~/.bashrc

如果要安装其他版本Python,比如3.7

sudo add-apt-repository ppa:jonathonf/python-3.7

sudo apt-get update

sudo apt-get install python3.7

更改版本

sudo rm /usr/bin/python
sudo ln -s /usr/bin/python3.7/bin/python3.7 /usr/bin/python

4.安装pip
4.1 添加源

deb http://cn.archive.ubuntu.com/ubuntu bionic main multiverse restricted universe
deb http://cn.archive.ubuntu.com/ubuntu bionic-updates main multiverse restricted universe
deb http://cn.archive.ubuntu.com/ubuntu bionic-security main multiverse restricted universe
deb http://cn.archive.ubuntu.com/ubuntu bionic-proposed main multiverse restricted universe

4.2 sudo apt-get update
4.3 安装pip3 sudo apt-get install python3-pip

Selunium
1. 首先安装pip apt-get install python3-pip
2. pip3 install selunium==2.53.6
pycharm
1.选择pycharm 2018.3.2版本（版本选择很重要，2019版本无法激活）
pycharm下载
下载之后解压

2.下载破解JAR包
百度云盘
提取码：8j3n
将下载下来的jar包放在pycharm-2018.3.2/bin目录里面

3.修改pycharm.vmoptions文件
pycharm-2018.3.2/bin下的pycharm.vomoptions和pycharm64.vmoptions
最后一行增加-javaagent:/opt/pycharm-2019.2.2/bin/JetbrainsIdesCrack-3.4-release-enc.jar
-javaagent:JetbrainsIdesCrack-3.4-release-enc.jar完整路径

4.启动pycharm
Activation code 随便写点东西，提交，激活成功

二、CVS文件

CSV文件：Comma-Separated Values，中文叫，逗号分隔值或者字符分割值，其文件以纯文本的形式存储表格数据。该文件是一个字符序列，可以由任意数目的记录组成，记录间以某种换行符分割。每条记录由字段组成，字段间的分隔符是其他字符或者字符串。所有的记录都有完全相同的字段序列，相当于一个结构化表的纯文本形式。
用文本文件、EXcel或者类似与文本文件的都可以打开CSV文件。

三、scrapy安装出错（该内容与本文无关，本来是准备使用scrapy框架，后来还是使用selenium完成爬虫）

UserWarning: You do not have a working installation of the service_identity module: 'cannot import name 'opentype''. Please install it from https://pypi.python.org/pypi/service_identity and make sure all of its dependencies are satisfied. Without the service_identity module, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.

执行命令解决sudo pip3 install --upgrade google-auth-oauthlib

四、爬虫

使用chrome浏览器
快捷键F12 打开开发者工具，选择Elements标签

使用元素选择工具，在页面选取元素

slenium常用基本方法

方法	作用
browser = webdriver.Chrome(executable_path=chromedriver_path,)	获得一个浏览器对象，executable_path是chromedriver的地址
brower.get(url)	打开url表示的网页
browser.find_elements_by_css_selector("div.item-main")	选择多个相同类元素，列表形式
browser.find_element_by_css_selector(".title.ellipsis>a")	选择单个元素
browser.find_element_by_css_selector(".record.util-clearfix div.num").text	获取元素文本内容
browser.find_element_by_css_selector("").get_attribute("")	获取元素属性
browser.find_element_by_id("#tag")`	根据id选择元素
browser.find_element_by_name("password")	根据元素名选择元素
browser.current_window_handle	目前窗口句柄
browser.window_handles	获得所有句柄
browser.switch_to.window(thisHandle)	切换到thisHandle

完整代码

# --coding: utf-8 --

from selenium import webdriver

fo = open("clothesINFO.csv", "w", encoding='utf8')  #保存数据的文件

chromedriver_path = "/usr/local/bin/chromedriver"

options = webdriver.ChromeOptions()  #配置两个属性
options.add_experimental_option('excludeSwitches',
                                ['enable-automation'])  # 此步骤很重要，设置为开发者模式，防止被各大网站识别出来使用了Selenium
options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})  # 不加载图片,加快网页加载速度

browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)


Alibaba_url = 'https://www.alibaba.com/trade/search?fsb=y&IndexArea=company_en&CatId=&SearchText=children+clothes'

browser.get(Alibaba_url)  #打开网页

for num in range(1, 58):    #这里指出需要爬取的页面范围
    goodsDiv = browser.find_elements_by_css_selector("div.item-main")   #获取商品div的集合
    next_button = browser.find_element_by_css_selector('a.next')        #获取下一页链接按钮，用于翻页

    for single in goodsDiv:
        available = False
        title = single.find_element_by_css_selector(".title.ellipsis>a").text  #公司名称
        titleSplit = title.split(" ", 1)        #提取公司名称前缀的地区信息
        region = titleSplit[0]

        #所以每个商品非必有元素都放在try-except块内进行获取

       #获取总收入
        try:                                        
            revenue = single.find_element_by_css_selector(".record.util-clearfix div.num").text
        except :
            revenue = "unknown"

        if revenue != "unknown":
            if "%" in revenue:
                revenue = "unknown"
       
        #获取主要产品
        try:                                    
            mainProducts = single.find_element_by_css_selector(".value.ellipsis.ph").text
        except :
            mainProducts = "None"
        
        #获取主要市场
        markets = single.find_elements_by_css_selector(".ellipsis.search")
        marketList = []
        for m in markets:
            if 'China' in m.text:    #只爬取中国市场
                available = True
            elif "Million" in m.text:
                pass
            else:
                marketList.append(m.text)

        #获取交易量
        try:
            transaction = single.find_element_by_css_selector(".record.util-clearfix div.lab b").text
        except :
            transaction = "0"
        
        if available:
            print(title, region, mainProducts, marketList, transaction, revenue)
            market = ""
            for m in marketList:
                market = market+m+" "
            fo.write(title+";"+region+";"+mainProducts+";"+market+";"+transaction+";"+revenue+"\n") 
            fo.flush()    #写入cvs文件

    print("第{}页爬取成功".format(num))
    next_button.click()

五、数据后处理

数据处理的目的有三个：

公司名后增加一列编号列用于区分公司名相同的情况，总收入列去掉“+”后缀
主要产品另做一张表单，用于数据库导入
市场分布另做一张表单，用于数据库导入，百分比形式转换为小数形式

处理前，只有一张总表

处理后，总表

处理后，产品表

处理后，市场分布表

xlrd完成Excel表格的读取
xlwt完成Excel表格的写入

我是看这位老哥的博文写的程序，他写得很好，就直接放在这了
利用Python读取和修改Excel文件