爬虫学习笔记(一)--urllib总结

基础知识:

1.url(Uniform Resource Locator):叫做统一资源定位符,是互联网上标准资源的地址,俗称“网址”。

2.在python 3.x中已经没有了urllib2库,只有urllib一个库了。

3.url Encoding也叫做percent—encode,即URL编码也叫做百分号编码。

4.python2.7中的urllib2就是python3中的urllib.request

robotparser变为了urllib库中的一个模块


根据官方手册,urllib是处理url的一个库:

其中有四个模块:

1.urllib.request用来打开和读取urls

     1.1.urlopen函数是常用的打开url方式。

     1.2.用built_opener函数构建opener来打开网页时高级方式。

2.urllib.error包含了运行urllib.request的过程中发生的错误

3.urllib.parse用来分析网址(urls)

4.urllib.robotparser用来分析robots.txt文件



一、urllib.request中常用的函数

urllib.request.urlopen(url, data=None, [timeout,], cafile=None, capath=None, cadefault=False, context=None)

1.urllib.request 模块用HTTP/1.1协议以及包括Connection:close的头部在它的http请求中。

2.可供选择的timeout参数指明阻止连接时间,请求连接的操作timeout秒后还没连接上,就会抛出连接超时的异常。若没有设置则为全局变量中缺省的超时时间。

3.对于HTTP and HTTPS URLs,这个函数返回的是一个http.client.HTTPResponse对象(进行了轻微的修饰),该对象有如下方法:

-   该对象是类文件对象,类文件的方法都可以使用,(read,readline,fileno,close)

-   geturl():返回请求的url

-   getcode():返回响应的http状态码,200表示请求成功得到响应,404表示请求没响应

-   info():返回httplib.HTTPMessage对象,表示远程服务器返回的头部信息


二、urllib.parse中常用函数:

1.urllib.parse.urlparse(url,scheme='',allow_fragments=True):

-用来分析一个URL,并分解为6个组成部分

-返回一个6个元素的元组:(scheme,netloc,path,params,query,fragment)是一个urllib.parse.ParseResult对象

并且该对象有这6个元素对应的方法

eg:

>>>from urllib import parse

>>>url = r'https://docs.python.org/3.5/search.html?q=parse&check_keywords=yes&area=default'

>>>parseResult= parse.urlparse(url)

>>>parseResult#把地址解析成组件

ParseResult(scheme='https', netloc='docs.python.org', path='/3.5/search.html', params='', query='q=parse&check_keywords=yes&area=default', fragment='')

>>>parseResult.query

'q=parse&check_keywords=yes&area=default'

看结果就知道是什么意思了


2.urllib.parse.urlunparse(Tuple)

-是urlparse的逆过程

-输入是6个元素的元组,输出是完整的url地址


3.urllib.parse.urljoin

urljoin(base, url, allow_fragments=True)

        Join a base URL and a possibly relative URL to form an absolute

        interpretation of the latter.

-base是url的基地址

-base与第二个参数中的相对地址相结合组成一个绝对URL地址


eg:

>>>scheme='http'

>>>netloc='www.python.org'

>>>path='lib/module-urlparse.html'

>>>modlist=('urllib','urllib2','httplib')

>>> unparsed_url=parse.urlunparse((scheme,netloc,path,'','',''))

>>> unparsed_url

'http://www.python.org/lib/module-urlparse.html'

>>> for mod in modlist:

url=parse.urljoin(unparsed_url,'module-%s.html'%mod)

print(url)


#替换是从最后一个"/"处替换的

http://www.python.org/lib/module-urllib.html

http://www.python.org/lib/module-urllib2.html

http://www.python.org/lib/module-httplib.html

>>> 


4.urllib.parse.parse_qs(qs,keep_blank_values=False,strict_parsing=False,encoding='urf-8',error='replace'):

-用来分析字符串形式的query请求。(Parse a query given as a string argument)

qs参数:url编码的字符串query请求(get请求)。

-返回query请求的参数字典

eg:

接上,

>>> param_dict=parse.parse_qs(parseResult.query)

>>> param_dict

>>> {'area': ['default'], 'check_keywords': ['yes'], 'q': ['parse']}


5.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x0365CC90>)

#对query合并,并且进行url编码

>>> from urllib import parse

>>> query={'name':'walker','age':99}

>>> parse.urlencode(query)

'name=walker&age=99'


总结:

1.2.是对url整体的处理,包括分解和组合。

4.5是对url中的query这个参数的处理。


5.urllib.parse.quote(string, safe='/', encoding=None, errors=None)

#对字符串进行url编码

1.url字符串中如果带有中文的编码,要使用url时。先将中文部分编码由gbk译为utf8

然后在urllib.quote(str)才可以使用url正常访问打开,否则编码会出问题。

2.同样如果从url中取出相应中文字段解码时,需要先unquote,然后在decode,具体按照gbk或者utf8,视情况而定。

eg:

>>>from urllib import parse

>>>parse.quote('a&b/c')#未编码斜线

'a%26b/c'

>>>parse.quote_plus('a&b/c')#编码了斜线

6.unquote(string, encoding='utf-8', errors='replace')

>>>parse.unquote('1+2')

'1+2'

>>> parse.unquote_plus('1+2')

'1 2'


三、urllib.robotparser

用来分析robots.txt文件,看是否支持该爬虫

eg:

>>>from urlli import robotparser

>>>rp=robotparser.RobotFileParser()

>>>rp.set_url('http://example.webscraping.com/robots.txt')#读入robots.txt文件

>>>rp.read()

>>>url='http://example.webscraping.com'

>>>user_agent='GoodCrawler'

>>>rp.can_fetch(user_agent,url)

True


详细说明,见下面函数文档:

FUNCTIONS

    parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

        Parse a query given as a string argument.


      Arguments:


        qs: percent-encoded query string to be parsed


        keep_blank_values: flag indicating whether blank values in

            percent-encoded queries should be treated as blank strings.

            A true value indicates that blanks should be retained as

            blank strings.  The default false value indicates that

            blank values are to be ignored and treated as if they were

            not included.


        strict_parsing: flag indicating what to do with parsing errors.

            If false (the default), errors are silently ignored.

            If true, errors raise a ValueError exception.


        encoding and errors: specify how to decode percent-encoded sequences

            into Unicode characters, as accepted by the bytes.decode() method.


    parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace')

        Parse a query given as a string argument.


        Arguments:


        qs: percent-encoded query string to be parsed


        keep_blank_values: flag indicating whether blank values in

            percent-encoded queries should be treated as blank strings.  A

            true value indicates that blanks should be retained as blank

            strings.  The default false value indicates that blank values

            are to be ignored and treated as if they were  not included.


        strict_parsing: flag indicating what to do with parsing errors. If

            false (the default), errors are silently ignored. If true,

            errors raise a ValueError exception.


        encoding and errors: specify how to decode percent-encoded sequences

            into Unicode characters, as accepted by the bytes.decode() method.


        Returns a list, as G-d intended.


    quote(string, safe='/', encoding=None, errors=None)

        quote('abc def') -> 'abc%20def'


        Each part of a URL, e.g. the path info, the query, etc., has a

        different set of reserved characters that must be quoted.


        RFC 2396 Uniform Resource Identifiers (URI): Generic Syntax lists

        the following reserved characters.


        reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |

                      "$" | ","


        Each of these characters is reserved in some component of a URL,

        but not necessarily in all of them.


        By default, the quote function is intended for quoting the path

        section of a URL.  Thus, it will not encode '/'.  This character

        is reserved, but in typical usage the quote function is being

        called on a path where the existing slash characters are used as

        reserved characters.


        string and safe may be either str or bytes objects. encoding and errors

        must not be specified if string is a bytes object.


        The optional encoding and errors parameters specify how to deal with

        non-ASCII characters, as accepted by the str.encode method.

        By default, encoding='utf-8' (characters are encoded with UTF-8), and

        errors='strict' (unsupported characters raise a UnicodeEncodeError).


    quote_from_bytes(bs, safe='/')

        Like quote(), but accepts a bytes object rather than a str, and does

        not perform string-to-bytes encoding.  It always returns an ASCII string.

        quote_from_bytes(b'abc def?') -> 'abc%20def%3f'


    quote_plus(string, safe='', encoding=None, errors=None)

        Like quote(), but also replace ' ' with '+', as required for quoting

        HTML form values. Plus signs in the original string are escaped unless

        they are included in safe. It also does not have safe default to '/'.


    unquote(string, encoding='utf-8', errors='replace')

        Replace %xx escapes by their single-character equivalent. The optional

        encoding and errors parameters specify how to decode percent-encoded

        sequences into Unicode characters, as accepted by the bytes.decode()

        method.

        By default, percent-encoded sequences are decoded with UTF-8, and invalid

        sequences are replaced by a placeholder character.


        unquote('abc%20def') -> 'abc def'.


    unquote_plus(string, encoding='utf-8', errors='replace')

        Like unquote(), but also replace plus signs by spaces, as required for

        unquoting HTML form values.


        unquote_plus('%7e/abc+def') -> '~/abc def'


    unquote_to_bytes(string)

        unquote_to_bytes('abc%20def') -> b'abc def'.


    urldefrag(url)

        Removes any existing fragment from URL.


        Returns a tuple of the defragmented URL and the fragment.  If

        the URL contained no fragments, the second element is the

        empty string.


    urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=<function quote_plus at 0x0365CC90>)

        Encode a dict or sequence of two-element tuples into a URL query string.


        If any values in the query arg are sequences and doseq is true, each

        sequence element is converted to a separate parameter.


        If the query arg is a sequence of two-element tuples, the order of the

        parameters in the output will match the order of parameters in the

        input.


        The components of a query arg may each be either a string or a bytes type.


        The safe, encoding, and errors parameters are passed down to the function

        specified by quote_via (encoding and errors only if a component is a str).


    urljoin(base, url, allow_fragments=True)

        Join a base URL and a possibly relative URL to form an absolute

        interpretation of the latter.


    urlparse(url, scheme='', allow_fragments=True)

        Parse a URL into 6 components:

        <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

        Return a 6-tuple: (scheme, netloc, path, params, query, fragment).

        Note that we don't break the components up in smaller bits

        (e.g. netloc is a single string) and we don't expand % escapes.


    urlsplit(url, scheme='', allow_fragments=True)

        Parse a URL into 5 components:

        <scheme>://<netloc>/<path>?<query>#<fragment>

        Return a 5-tuple: (scheme, netloc, path, query, fragment).

        Note that we don't break the components up in smaller bits

        (e.g. netloc is a single string) and we don't expand % escapes.


    urlunparse(components)

        Put a parsed URL back together again.  This may result in a

        slightly different, but equivalent URL, if the URL that was parsed

        originally had redundant delimiters, e.g. a ? with an empty query

        (the draft states that these are equivalent).


    urlunsplit(components)

        Combine the elements of a tuple as returned by urlsplit() into a

        complete URL as a string. The data argument can be any five-item iterable.

        This may result in a slightly different, but equivalent URL, if the URL that

        was parsed originally had unnecessary delimiters (for example, a ? with an

        empty query; the RFC states that these are equivalent).


DATA

    __all__ = ['urlparse', 'urlunparse', 'urljoin', 'urldefrag', 'urlsplit...


FILE

    d:\python3\lib\urllib\parse.py

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 158,847评论 4 362
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,208评论 1 292
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 108,587评论 0 243
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 43,942评论 0 205
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,332评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,587评论 1 218
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,853评论 2 312
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,568评论 0 198
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,273评论 1 242
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,542评论 2 246
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,033评论 1 260
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,373评论 2 253
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,031评论 3 236
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,073评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,830评论 0 195
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,628评论 2 274
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,537评论 2 269

推荐阅读更多精彩内容