用Python大神Kennethreitz新框架request-html爬妹纸图遇坑记

  1. 不了解用法的可以去了解用法 代码传送门

  2. 这个网站不小心点开的,感觉大家应该都会喜欢,下载图片这块要仔细,其他地方按部就搬

    • 一般下载图片都是用

        req = request.Request(url, headers=headers)  
        f.write(request.urlopen(req).read())  
      

    headers中包含了Use-Agent或者Referer,主要是模拟浏览器操作,Referer一定得加上,告诉服务器我是从妹纸首页进来的,否则下载的都是腾讯QQ图片

  3. 直接上代码吧

    有几千张图片,下载会花点时间,对了,之前都是import session,然后再session.get(''),现在github上改成了import HTMLSession(),然后在get,这个网站图片总共就12页,然后第一页的网址和格式后面不同,没有后缀

     #coding=utf-8  
     """ 
     @author:Jianxiong Rao 
     @data:2018/3/12 
     @version:Python3.6 
     """  
     from requests_html import HTMLSession  
     import os  
     import time  
       
     class MM(object):  
         def __init__(self):  
             self.__page = 1  
             self.__url = "http://www.mm131.com/qingchun/list_1_{}.html"  
             self.__session = HTMLSession()  
             self.__headers = {  
                 'Referer':'http://www.mm131.com/qingchun/',  
                 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'  
             }  
             self.__imagePath = r'D:/Photo/MM'  
             self.__confirmPath()  
       
         def __confirmPath(self):  
             if not os.path.exists(self.__imagePath):  
                 os.makedirs(self.__imagePath)  
                   
         def download(self,link,fileName):  
             try:  
                 with open(self.__imagePath+'/'+fileName+'.jpg','wb') as f:  
               
                     f.write(self.__session.request('get',link,headers = self.__headers,allow_redirects=False).content)  
             except Exception as e:  
                 print(str(e))  
       
         def parseData(self):  
             start = time.time()
             while self.__page < 12:  
                 if self.__page == 1:  
                     self.__url = "http://www.mm131.com/qingchun/"  
                 else:  
                     self.__url = 'http://www.mm131.com/qingchun/list_1_{}.html'.format(self.__page)  
                 r = self.__session.get(self.__url)  
                 main = r.html.find(".main",first=True)  
                 dl = main.find('dl')[0]  
                 dds = dl.find('dd')  
                 for dd in dds[:-2]:  
                     attr = dd.find('img')[0].attrs  
                     imageLink = attr['src']  
                     title = attr['alt']  
                     self.download(imageLink,title)  
                 self.__page += 1
             end = time.time() - start
             print("爬取时间:",end)  
       
     if __name__=="__main__":  
         mm = MM()  
         mm.parseData()  
    

由于图片不太好,自己多动手运行,测试

推荐阅读更多精彩内容