python3 urllib 爬虫基本使用

urllib提供了一系列用于操作URL的功能。
urllib的request模块可以非常方便地抓取URL内容，也就是发送一个GET请求到指定的页面，然后返回HTTP的响应

01 简单使用

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request

request_url = 'http://www.baidu.com'           # 需要请求的URL地址
response = urllib.request.urlopen(request_url) # 发起请求
print(response.read().decode('utf-8'))         # 打印响应的文本，并进行UTF-8解码

read(), readline(), readlines(), fileno(), close()：对HTTPResponse类型数据进行操作
info()：返回HTTPMessage对象，表示远程服务器返回的头信息
getcode()：返回Http状态码。如果是http请求，200请求成功完成、404网址未找到等等
geturl()：返回请求的url

02 GET 方法

 #!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib.request
import urllib.parse

get_data = {'username': 'aaa', 'password': 'bbb'}          # 此处将GET的数据定义为一个字典
get_data_encode = urllib.parse.urlencode(get_data)         # 将GET的数据进行编码

request_url = 'http://www.baidu.com'              # 需要请求的URL地址
request_url += '?' + get_data_encode                       # 追加GET参数到URL后面

# https://www.zhihu.com/#signin?username=aaa&password=bbb
print(request_url)

# 发起请求
response = urllib.request.urlopen(request_url)
print(response.read().decode('utf-8'))         # 打印响应的文本，并进行UTF-8解码

03 GET并获取header信息

from urllib import request

with request.urlopen('http://www.baidu.com') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data.decode('utf-8'))

04 POST 方法

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 

import urllib.request
import urllib.parse

post_data = {'first': 'true', 'pn': 1, 'kd': 'Python'}      # 此处将POST的数据定义为一个字典
post_data_encode = urllib.parse.urlencode(post_data)        # 将POST的数据进行编码

# UTF-8编码
# 否则会报错：POST data should be bytes or an iterable of bytes. It cannot be of type str.
post_data_encode = post_data_encode.encode(encoding='utf-8')
request_url = 'http://www.lagou.com/jobs/positionAjax.json?'               # 需要请求的URL地址

# 发起请求
# 此处增加了第二个参数为传送的POST数据（默认为None）
# 第三个参数为请求超时时间，默认为socket._GLOBAL_DEFAULT_TIMEOUT
response = urllib.request.urlopen(request_url, post_data_encode, 3)
print(response.read().decode('utf-8'))         # 打印响应的文本，并进行UTF-8解码

from urllib import request, parse

print('Login to weibo.cn...')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', '1'),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'https://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

04 使用Request 设置Headers属性

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import urllib.request
import urllib.parse

user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87'
referer = 'http://www.lagou.com/jobs/positionAjax.json?'
post_data = {'first': 'true', 'pn': 1, 'kd': 'Python'}                              # 此处将POST的数据定义为一个字典
headers = {'User-Agent': user_agent, 'Referer': referer}                            # Headers属性初始化
post_data_encode = urllib.parse.urlencode(post_data)                                # 将POST的数据进行编码

# UTF-8编码
# 否则会报错：POST data should be bytes or an iterable of bytes. It cannot be of type str.
post_data_encode = post_data_encode.encode(encoding='utf-8')
request_url = 'http://www.lagou.com/zhaopin/Python/?labelWords=label'               # 需要请求的URL地址

# 使用Request来设置Headers
request = urllib.request.Request(request_url, post_data_encode, headers)

response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))         # 打印响应的文本，并进行UTF-8解码

06 Proxy（代理）的设置

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib import request

request_url = 'http://www.xmgc360.com/project/test.php'
proxy = request.ProxyHandler({'http': '119.28.54.102:3389'})   # 设置代理服务器
opener = request.build_opener(proxy)                            # 挂载opener
request.install_opener(opener)                                  # 安装opener
response = request.urlopen(request_url)
print(response.read().decode('utf-8'))         # 打印响应的文本，并进行UTF-8解码

07 异常处理

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from urllib import request

request_url = 'http://www.lagou.com/jobs/positionAjax.json?'
proxy = request.ProxyHandler({'http': '127.0.0.1:8989'}) # 设置代理服务器
opener = request.build_opener(proxy)                         # 挂载opener
request.install_opener(opener)                               # 安装opener
try:
    response = request.urlopen(request_url)
except Exception as e:
    print(e)                   # 打印错误码

08 练习

http://image.baidu.com/channel/listjson?pn=1&rn=30&tag1=%E6%98%8E%E6%98%9F&tag2=%E5%85%A8%E9%83%A8&ie=utf8

抓取信息并保存到数据库

最后编辑于：2018.03.15 14:35:51

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 161,513评论 4赞 369
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 68,312评论 1赞 305
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 111,124评论 0赞 254
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,529评论 0赞 217
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,937评论 3赞 295
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,913评论 1赞 224
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 32,084评论 2赞 317
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,816评论 0赞 205
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,593评论 1赞 249
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,788评论 2赞 253
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,267评论 1赞 265
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,601评论 3赞 261
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,265评论 3赞 241
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,158评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,953评论 0赞 201
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 36,066评论 2赞 285
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,852评论 2赞 277

python3 urllib 爬虫基本使用

01 简单使用

02 GET 方法

03 GET并获取header信息

04 POST 方法

04 使用Request 设置Headers属性

06 Proxy（代理）的设置

07 异常处理

推荐阅读更多精彩内容