计算机基础导论学习总结中

第四单元

响应查询

本单元将建立一个可行的搜索引擎，可以抓取并建立一系列网页的索引，可以响应关键词查询。首先是建立网络语料库的索引，其结构将会是：列表的每一项是一个关键词，及该关键词对应的一系列URL，后面可能加一项指出共有多少页面含该关键词。

根据这个结构构建索引如下：

index = []

def add_to_index(index,keyword,url):
    for e in index:
        if e[0] == keyword:
            e[1].append(url)
            return index
    index.append([keyword,[url]])

add_to_index(index,'udacity','http://udacity.com')
add_to_index(index,'computing','http://acm.org')
add_to_index(index,'udacity','http://npr.org')
print index
#>>> [['udacity', ['http://udacity.com', 'http://npr.org']], 
#>>> ['computing', ['http://acm.org']]]

根据关键词查找：

index = [['udacity', ['http://udacity.com', 'http://npr.org']],
         ['computing', ['http://acm.org']]]

def lookup(index,keyword):
    for e in index:
        if e[0] == keyword:
            return e[1]
    return []

print lookup(index,'udacity')
#>>> ['http://udacity.com','http://npr.org']

将完整的页面添加进索引：

index = []

def add_to_index(index,keyword,url):
    for entry in index:
        if entry[0] == keyword and not url in entry[1]:               #保证一个关键词的一个URL只记录一次
            entry[1].append(url)
            return
    index.append([keyword,[url]])

def add_page_to_index(index,url,content):
    l = content.split()             #对页面内容粗略的切分
    for e in l:
        ok = 0                    #标记index中是否已存在该关键词
        for x in index:
            if e == x:
                x[1].append(url)
                ok = 1
        if not ok:
            index.append([e,[url]])
    return index

add_page_to_index(index,'fake.text',"This is a test")
print index
#>>> [['This', ['fake.text']], ['is', ['fake.text']], ['a', ['fake.text']],
#>>> ['test',['fake.text']]]

然后在第三单元定义的crawl_web函数中获取到content后加上一句add_page_to_index(index, page, content)即可生成索引，实际上第三单元给出的正是已加上索引的完整版代码。

本单元接下来介绍了因特网的基本原理，首先是页面爬取函数

def get_page(url):
    try:
        import urllib
        return urllib.urlopen(url).read()
    except:
        return ''

网络的定义：

两点：至少3个实体；即使两个实体不直接相连，仍可以通过第三个实体的传递消息而通信。

延迟与带宽：

之前调用系统自带的split函数切分页面内容出现一些问题，下面实现一个更为细致的切分函数

def split_string(source,splitlist):
    output = []
    atsplit = True                #标记是否到分隔符
    for e in source:
        if e in splitlist:           #splitlist是切分符号表，如标点符号等
            atsplit = True
        else:
            if atsplit:                     #前一个字符是分隔符，重新加入新字符
                output.append(e)
                atsplit = False
            else:                        #前一个字符不是分隔符，在最后一个字符串上加字符
                output[-1] = output[-1] + e
                
    return output

out = split_string("This is a test-of the,string separation-code!"," ,!-")
print out
#>>> ['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code']

out = split_string("After  the flood   ...  all the colors came out.", " .")
print out
#>>> ['After', 'the', 'flood', 'all', 'the', 'colors', 'came', 'out']

out = split_string("First Name,Last Name,Street Address,City,State,Zip Code",",")
print out
#>>>['First Name', 'Last Name', 'Street Address', 'City', 'State', 'Zip Code']

统计点击数，修改索引结构为：

def record_user_click(index,keyword,url):
    l = lookup(index, keyword)
    if l:                        #找到该关键词
        for e in l:
            if e[0] == url:
                e[1] += 1
                return index
        add_to_index(index, keyword, url)
    else:
        add_to_index(index, keyword, url)

def add_to_index(index, keyword, url):
    for entry in index:
        if entry[0] == keyword:
            for urls in entry[1]:               #去重
                if urls[0] == url:
                    return
            entry[1].append([url,0])
            return
    # not found, add new keyword to index
    index.append([keyword, [[url,0]]])

统计一段话中的词语个数

def count_words(page):
    s = 0
    ok = True
    for e in page:
        if e == ' ':
            ok = True
        else:
            if ok:
                s += 1
                ok = False
    return s

passage =("The number of orderings of the 52 cards in a deck of cards "
"is so great that if every one of the almost 7 billion people alive "
"today dealt one ordering of the cards per second, it would take "
"2.5 * 10**40 times the age of the universe to order the cards in every "
"possible way.")
print count_words(passage)
#>>>56

第五单元

程序怎样运行

首先要评估程序的好坏就要测定它运行时所占用的资源，即算法时间复杂度与空间复杂度。以下的程序以秒为单位测定一个程序运行的时间：

import time
def time_execution(code):
    start = time.clock()
    result = eval(code)
    run_time = time.clock() - start
    return result , run_time

def spin_loop(n):
    i = 0
    while i < n:
        i = i + 1
#测试循环1000次加法的时间
time_execution('spin_loop(1000)')
=>
(None, 0.000136000000000025)

其中eval函数的功能是：将字符串str当成有效的表达式来求值并返回计算结果。就可以用来在list,tuple,dict和string之间相互转化：

#字符串转换成列表
>>>a = "[[1,2], [3,4], [5,6], [7,8], [9,0]]"
>>>type(a)
<type 'str'>
>>> b = eval(a)
>>> print b
[[1, 2], [3, 4], [5, 6], [7, 8], [9, 0]]
>>> type(b)
<type 'list'>


#字符串转换成字典
>>> a = "{1: 'a', 2: 'b'}"
>>> type(a)
<type 'str'>
>>> b = eval(a)
>>> print b
{1: 'a', 2: 'b'}
>>> type(b)
<type 'dict'>



#字符串转换成元组
>>> a = "([1,2], [3,4], [5,6], [7,8], (9,0))"
>>> type(a)
<type 'str'>
>>> b = eval(a)
>>> print b
([1, 2], [3, 4], [5, 6], [7, 8], (9, 0))
>>> type(b)
<type 'tuple'>

构建一张大索引表，检测查找关键词的效率：

def add_to_index(index, keyword, url):
    for entry in index:
        if entry[0] == keyword:
            for urls in entry[1]:
                if urls[0] == url:
                    return
            entry[1].append([url])
            return
    # not found, add new keyword to index
    index.append([keyword, [url]])

def lookup(index, keyword):
    for entry in index:
        if entry[0] == keyword:
            return entry[1]
    return None

#将字符列表p中的元素连成字符串s
def make_string(p):
    s = ""
    for e in p:
        s += e
    return s

def make_big_index(size):
    index = []
    letters = ['a','a','a','a','a','a']
    while len(index) < size:
        word = make_string(letters)
        add_to_index(index, word, 'fake')   #每个关键词的URL都是fake
        #按字母从尾到头递增加入索引
        for i in range(len(letters) - 1, 0, -1):
            if letters[i] < 'z':
                letters[i] = chr(ord(letters[i]) + 1)
                break;
            else:
                letters[i] = 'a'
    
    return index 

#构建索引表，测试查找其中最后一个关键词的时间
index = make_big_index(10000)
print time_execution('lookup(index, "aaaoup")')
=>
(['fake'], 0.0017249999998512067)

然后介绍哈希表的思想，及好、坏哈希表的比较：

#坏哈希表，关键词分布不均
def bad_hash(key, size):
    return ord(key[0]) % size

#较好的哈希表实现
def hash_string(keyword,buckets):
    s = 0
    for e in keyword:
        s += ord(e)
    return  s % buckets

#哈希表测试函数，检测每个桶中关键词的分布情况
def test_hash_func(func, keys, size):
    ans = [0] * size
    key_used = []
    for w in keys:
        if w not in key_used:
            hv = func(w, size)
            ans[hv] += 1
            key_used.append(w)
            
    return ans

#测试
def get_page(url):
    try:
        import urllib
        return urllib.urlopen(url).read()
    except:
        return ''

words = get_page('http://www.gutenberg.org/cache/epub/1661/pg1661.txt').split()

cnt = test_hash_func(bad_hash, words, 12)
print cnt
=>
[730, 1541, 1055, 1752, 1784, 839, 1452, 2074, 1409, 754, 924, 899]

cnt = test_hash_func(hash_string, words, 12)
print cnt
=>
[1368, 1268, 1273, 1279, 1284, 1245, 1207, 1228, 1281, 1232, 1233, 1315]

cnt = test_hash_func(hash_string, words, 100)
print cnt
=>
[136, 127, 117, 137, 129, 149, 116, 126, 111, 128, 142, 131, 151, 129, 150, 124, 157, 144, 151, 150, 137, 105, 151, 144, 141, 153, 141, 185, 144, 154, 154, 163, 192, 159, 163, 190, 153, 177, 162, 175, 172, 166, 179, 164, 186, 167, 173, 144, 174, 167, 154, 164, 177, 179, 163, 171, 187, 162, 160, 181, 166, 161, 136, 154, 169, 156, 150, 147, 154, 164, 126, 173, 156, 165, 146, 151, 150, 145, 148, 152, 148, 148, 161, 140, 188, 150, 150, 121, 167, 123, 142, 132, 136, 132, 126, 141, 152, 135, 152, 162]

建立空的哈希表：

def make_hashtable(nbuckets):
    l = []
    for i in range( nbuckets):
        l.append([])
        
    return l

print make_hashtable(3)
=>
[[], [], []]

#按以下方式实现则存在问题
def make_hashtable_not(nbuckets):
    return [[]] * nbuckets
t = make_hashtable_not(3)
t[1].append(['yes',['https://udacity.com']])
print t
=>
[[['yes', ['https://udacity.com']]], [['yes', ['https://udacity.com']]], [['yes', ['https://udacity.com']]]]

这种方式创建的新列表是这个列表的三次复制，但不是三个副本，而是三个指向。外层列表的每一个元素都指向同一个空列表。

接下来要实现哈希表的查找更新等操作，表的结构为：

定义hashtable_get_bucket函数，找到所给关键词对应对哈希桶：

def hashtable_get_bucket(htable,keyword):
    l = len(htable)
    i = hash_string(keyword,l)
    return htable[i]


def hash_string(keyword,buckets):
    out = 0
    for s in keyword:
        out = (out + ord(s)) % buckets
    return out

def make_hashtable(nbuckets):
    table = []
    for unused in range(0,nbuckets):
        table.append([])
    return table

table = [[['Francis', 13], ['Ellis', 11]], [], [['Bill', 17],
['Zoe', 14]], [['Coach', 4]], [['Louis', 29], ['Rochelle', 4], ['Nick', 2]]]

print hashtable_get_bucket(table, "Zoe")
#>>> [['Bill', 17], ['Zoe', 14]]

print hashtable_get_bucket(table, "Brick")
#>>> []

print hashtable_get_bucket(table, "Lilith")
#>>> [['Louis', 29], ['Rochelle', 4], ['Nick', 2]]

定义 hashtable_add(htable,key,value)函数，在哈希表中添加给定关键词及其值。在上一个函数的基础上只需要在对应的桶的最后添加一项：

def hashtable_add(htable,key,value):
    # your code here
    hashtable_get_bucket(htable,key).append([key,value])
    return htable  
    
    
def hashtable_get_bucket(htable,keyword):
    return htable[hash_string(keyword,len(htable))]

def hash_string(keyword,buckets):
    out = 0
    for s in keyword:
        out = (out + ord(s)) % buckets
    return out

def make_hashtable(nbuckets):
    table = []
    for unused in range(0,nbuckets):
        table.append([])
    return table

table = make_hashtable(5)
hashtable_add(table,'Bill', 17)
hashtable_add(table,'Coach', 4)
hashtable_add(table,'Ellis', 11)
hashtable_add(table,'Francis', 13)
hashtable_add(table,'Louis', 29)
hashtable_add(table,'Nick', 2)
hashtable_add(table,'Rochelle', 4)
hashtable_add(table,'Zoe', 14)
print table
#>>> [[['Ellis', 11], ['Francis', 13]], [], [['Bill', 17], ['Zoe', 14]], 
#>>> [['Coach', 4]], [['Louis', 29], ['Nick', 2], ['Rochelle', 4]]]

定义hashtable_lookup(htable,key)函数，查找对应关键词的值。在前几个函数的基础上，首先调用hashtable_lookup(htable,key)判断表中是否有该词，在该关键词不在表中时调用hashtable_add(htable,key,value)插入；在表中时调用hashtable_get_bucket(htable,key)查找该关键词对应的桶，遍历找到后更新：

def hashtable_lookup(htable,key):
    l = hashtable_get_bucket(htable,key)
    for e in l:
        if e[0] == key:
            return e[1]
    return None


def hashtable_add(htable,key,value):
    bucket = hashtable_get_bucket(htable,key)
    bucket.append([key,value])


def hashtable_get_bucket(htable,keyword):
    return htable[hash_string(keyword,len(htable))]

def hash_string(keyword,buckets):
    out = 0
    for s in keyword:
        out = (out + ord(s)) % buckets
    return out

def make_hashtable(nbuckets):
    table = []
    for unused in range(0,nbuckets):
        table.append([])
    return table


table = [[['Ellis', 11], ['Francis', 13]], [], [['Bill', 17], ['Zoe', 14]],
[['Coach', 4]], [['Louis', 29], ['Nick', 2], ['Rochelle', 4]]]

print hashtable_lookup(table, 'Francis')
#>>> 13

print hashtable_lookup(table, 'Louis')
#>>> 29

print hashtable_lookup(table, 'Zoe')
#>>> 14

定义hashtable_update(htable,key,value)
函数，更新给定关键词的值：

def hashtable_update(htable,key,value):
    # Your code here
    x = hashtable_lookup(htable,key)
    if not x:
        hashtable_add(htable,key,value)
    else:
        l = hashtable_get_bucket(htable,key)
        for e in l:
            if e[0] == key:
                e[1] = value
        
    return htable

def hashtable_lookup(htable,key):
    bucket = hashtable_get_bucket(htable,key)
    for entry in bucket:
        if entry[0] == key:
            return entry[1]
    return None

def hashtable_add(htable,key,value):
    bucket = hashtable_get_bucket(htable,key)
    bucket.append([key,value])


def hashtable_get_bucket(htable,keyword):
    return htable[hash_string(keyword,len(htable))]

def hash_string(keyword,buckets):
    out = 0
    for s in keyword:
        out = (out + ord(s)) % buckets
    return out

def make_hashtable(nbuckets):
    table = []
    for unused in range(0,nbuckets):
        table.append([])
    return table


table = [[['Ellis', 11], ['Francis', 13]], [], [['Bill', 17], ['Zoe', 14]],
[['Coach', 4]], [['Louis', 29], ['Nick', 2], ['Rochelle', 4]]]

hashtable_update(table, 'Bill', 42)
hashtable_update(table, 'Rochelle', 94)
hashtable_update(table, 'Zed', 68)
print table
#>>> [[['Ellis', 11], ['Francis', 13]], [['Zed', 68]], [['Bill', 42], 
#>>> ['Zoe', 14]], [['Coach', 4]], [['Louis', 29], ['Nick', 2], 
#>>> ['Rochelle', 94]]]

使用Python自带的字典类型来实现哈希表，首先介绍字典类型，字符串、列表、字典三者的比较：

接下来就是用字典改写crawl_web、add_to_index、lookup三个函数。crawl_web的改动就是将index = [] 改为 index = {}。由于crawl_web调用add_page_to_index，add_page_to_index又调用add_to_index，故只需修改add_to_index：

改为：

lookup改为：

def lookup(index, keyword):
    if keyword in index:
        return index[keyword]
    return None

第五单元主要是修改前面的搜索引擎，使之更高效，能迅速反馈查询。计算机科学思想的核心是分析算法复杂度，设计更有效的数据结构。

习题集

定义缓存，比较斐波那契数列在有缓存和无缓存下的计算效率，无缓存：

import time
def time_execution(code):
    start = time.clock()
    result = eval(code)
    run_time = time.clock() - start
    return result , run_time

def cached_fibo(n):
    if n == 1 or n == 0:
        return n
    else:
        return cached_fibo(n - 1) + cached_fibo(n - 2)
        
print time_execution('cached_fibo(40)')     
#计算第40项，花费约52.5s
=>
(102334155, 52.544790000000006)

有缓存：

import time
def time_execution(code):
    start = time.clock()
    result = eval(code)
    run_time = time.clock() - start
    return result , run_time

def cached_execution(cache, proc, proc_input):
    # Your code here
    if proc_input in cache:
        return cache[proc_input]
    cache[proc_input] = proc(proc_input)
    return cache[proc_input]

# Here is an example showing the desired behavior of cached_execution:

def factorial(n):
    print "Running factorial"
    result = 1
    for i in range(2, n + 1):
        result = result * i
    return result

cache = {}

def cached_fibo(n):
    if n == 1 or n == 0:
        return n
    else:
        return (cached_execution(cache, cached_fibo, n - 1 )
               + cached_execution(cache,  cached_fibo, n - 2 ))
               
 
print time_execution('cached_execution(cache, cached_fibo,100)')
#计算第100项，花费远小于一秒，可见记忆化能避免大多数重复的计算
=>
(354224848179261915075L, 0.00022900000000447562)

移动字符：

#后移一个
def shift(letter):
    return chr( ord('a') + ( ord(letter) + 1- ord('a') )%26)


print shift('a')
#>>> b
print shift('n')
#>>> o
print shift('z')
#>>> a

#后移n个
def shift_n_letters(letter, n):
    l = ord(letter) + n
    l = (l - ord('a')) % 26
    return chr(ord('a') + l)

print shift_n_letters('s', 1)
#>>> t
print shift_n_letters('s', 2)
#>>> u
print shift_n_letters('s', 10)
#>>> c
print shift_n_letters('s', -10)
#>>> i

后移字符串

def shift(c, t):
    l = (ord(c) - ord('a') + t + 26)%26
    l = l + ord('a')
    return chr(l)
    
def rotate(s, t):
    # Your code here
    l = len(s)
    a = ''
    for i in range(l):
        if ord(s[i]) >= ord('a') and ord(s[i]) <= ord('z'):
            a += shift(s[i],t)
        else:
            a += s[i]
    return a
        

print rotate ('sarah', 13)
#>>> 'fnenu'
print rotate('fnenu',13)
#>>> 'sarah'
print rotate('dave',5)
#>>>'ifaj'
print rotate('ifaj',-5)
#>>>'dave'
print rotate(("zw pfli tfuv nfibj tfiivtkcp pfl jyflcu "
                "sv rscv kf ivru kyzj"),-17)
#>>> if your code works correctly you should be able to read this

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,560评论 4赞 361
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,104评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,297评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,869评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,275评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,563评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,833评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,543评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,245评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,512评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,011评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,359评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,006评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,062评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,825评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,590评论 2赞 273
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,501评论 2赞 268

计算机基础导论 学习总结 中

第四单元

响应查询

第五单元

程序怎样运行

习题集

推荐阅读更多精彩内容

计算机基础导论学习总结中