Python内容分享活跃站点统计与一些文章链接汇总

根据PythonWeekly每期推荐的文章或教程的引用来源做简单统计，推测国外python内容分享活跃站点，并简单比较优质python博客或内容发布地的变化。
旧为2012-2013年左右，新为今年（2017）。格式： [(站点根域名，累计次数)，...]。
（新旧统计期数不同，新：最近20期；旧：最初的81期。只统计文章和教程的链接。）

可以看出Github和Youtube一直是活跃的分享站点。wordpress的份额变少了，blogspot基本看不到了，而medium成为风头正劲。变化还是很大的。

PythonWeekly的内容相当不错，可是有时太懒没看，错过不少期，又懒得在邮件列表中一个一个翻，于是做了个爬虫，只爬取其中的推荐文章/教程部分，生成一个可以快速浏览的markdown列表，便于筛选感兴趣的文章，在浏览器中打开也很方便搜索。不过，看英文标题的扫视速度还是太慢，申请了个百度翻译的api机翻了标题，虽然不如谷歌翻译，也勉强能看，毕竟看中文还是舒服些。点击下方的链接就可以查看汇总文章了。（简书放不下就放到了github上了，显示效果都差不多）

近期文章汇总

初期文章汇总

目前能找到的只有初始81期（对应initNews.py）和最近的20期（对应recentNews.py）（PythonWeekly 一周一期）。使用代码需要替换入你自己的百度翻译api秘钥。
initNews.py和recentNews.py基本差不多。不过，后者只使用了单线程，比较慢，但是量少，也没花多少时间。前者使用了多线程，速度提升很明显，虽然是4倍的量，但差不多一下子就完了。（此外，百度翻译api中的query使用'\n'分隔可一次翻译多个语句）

代码

initNews.py

import requests
from bs4 import BeautifulSoup
import re
# 请替换为你的秘钥
appid = 'yourappid'
secretKey = 'yoursecretkey'

from fake_useragent import UserAgent
ua = UserAgent()
headers = {'user-agent': ua.chrome}

pythonweekly_init_issues_archive_url = (
    'http://www.pythonweekly.com/archive/')

def get_pythonweekly_init_issues_urls():
    url = pythonweekly_init_issues_archive_url
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.content, 'lxml')
    return [[
                a.text.split(' ')[-1].strip(),
                ''.join([url, a['href']]),
            ] for a in soup.select('li a')]

pythonweekly_init_issues_urls = get_pythonweekly_init_issues_urls()

def get_single_issue_info(issue):
    try:
        # issue = [text, url, list] 
        url = issue[1]
        res = requests.get(url, headers=headers)
        soup = BeautifulSoup(res.content, 'lxml')
        content = soup.select_one('td .defaultText')

        try:
            submenus = [i.text for i in content.find_all('strong')]
            for index, menu in enumerate(submenus):
                if re.search('[Aa]rticles', menu):
                    break
            start_text = [menu,]
            end_text = submenus[index+1]
        except:
            # 脏改 
            start_text = ['Articles,\xa0Tutorials and Talks',
                          '\xa0Tutorials and Talks', # 应对11,12.html
                          'Articles Tutorials and Talks']
            end_text = 'Interesting Projects, Tools and Libraries'

        flag = 0
        list_ = []
        for s in content.find_all('span'):
            if not flag:
                if s.text not in start_text:
                    continue
                else:
                    flag = 1
                    continue
            if s.text == end_text:
                break
            try:
                one = [s.text.strip(), s.find('a')['href']]
                # print(one)
                list_.append(one)
            except TypeError:
                pass
        # return list_
        issue.append(list_)
        print('下载完成', issue[0])
    except Exception as e:
        print('wrong: ', issue[0], '\n', e)

from multiprocessing.dummy import Pool
pool = Pool(30)
pool.map(get_single_issue_info, pythonweekly_init_issues_urls)
pythonweekly_init_issues = pythonweekly_init_issues_urls


def baidu_translates(query):
    '''
    http://api.fanyi.baidu.com/api/trans/product/apidoc
    '''
    from hashlib import md5
    import random
    
    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
    fromLang = 'en'
    toLang = 'zh'
    salt = random.randint(32768, 65536)

    sign = appid + query + str(salt) + secretKey
    m1 = md5()
    m1.update(sign.encode('utf-8'))
    sign = m1.hexdigest()
    
    params = {'appid':appid,
              'q':query,
              'from':fromLang,
              'to':toLang,
              'salt':str(salt),
              'sign':sign,}
    res = requests.get(url, params=params)
    return res.json()['trans_result']

def get_translate(issue):
    articles = issue[-1]
    try:
        result = baidu_translates('\n'.join([i[0] for i in articles]))
        for index, i in enumerate(articles):
            i.append(result[index]['dst'])
        print('翻译完成', issue[0])
    except:
        print('**翻译失败**', issue[0])

pool.map(get_translate, pythonweekly_init_issues)


from jinja2 import Template
table = """
<table>
    {% for issue_num, issue_href, article_lists in issues %}
        {% for article_name, article_href, article_chinese in article_lists %}
        <tr>
            <td><a href='{{issue_href}}'>{{ issue_num }}</a></td>
            <td><a href='{{article_href}}'>{{ article_name }}</a></td>
            <td><a href='{{article_href}}'>{{ article_chinese }}</a></td>
        </tr>
        {% endfor %}
    {% endfor %}
</table>
"""

template = Template(table)
t = template.render(issues=pythonweekly_init_issues)

import time
with open('pythonweekly_init ' + time.ctime().replace(':', '_') + '.html', 'w', encoding='utf-8') as f:
    f.write(t)
    

pool.close()
pool.join()

# https://stackoverflow.com/questions/9626535/get-domain-name-from-url
# get_host = requests.urllib3.util.url.get_host # get_host(i[1])[1]
import tldextract
host_list = [
    tldextract.extract(i[1]).domain
    for *_, articles in pythonweekly_init_issues for i in articles ]

from collections import Counter
counter = Counter(host_list)
print(counter.most_common(20))

with open('pythonweekly_init.md', 'w', encoding='utf-8') as f:
    f.write(u'### PythonWeekly初期文章教程汇总\n')
    f.write(u'|      期号     |     英文名    | 中文名|\n')
    f.write(u'| ------------- |:-------------:| -----:|\n')
    for issue_num, issue_href, article_lists in pythonweekly_init_issues:
        for article_name, article_href, article_chinese in article_lists:
            f.write(('| [{issue_num}]({issue_href}) '
                     '| [{article_name}]({article_href}) '
                     '| [{article_chinese}]({article_href}) '
                     '| \n').format(**locals()))

recentNews.py

import requests
from bs4 import BeautifulSoup
import re
# 请替换为你的秘钥
appid = 'yourappid'
secretKey = 'yoursecretkey'


from fake_useragent import UserAgent
ua = UserAgent()
headers = {'user-agent': ua.chrome}

pythonweekly_recent_issues_archive_url = (
    'http://us2.campaign-archive2.com/home/'
    '?u=e2e180baf855ac797ef407fc7&id=9e26887fc5')

def get_pythonweekly_recent_issues_urls():
    
    res = requests.get(pythonweekly_recent_issues_archive_url, headers=headers)
    soup = BeautifulSoup(res.content, 'lxml')
    return [[
                a.text.split(' ')[-1].strip(),
                a['href'],
            ]
            for a in soup.select('li a')]

pythonweekly_recent_issues_urls = get_pythonweekly_recent_issues_urls()

def get_single_issue_info(url):
    res = requests.get(url, headers=headers)
    soup = BeautifulSoup(res.content, 'lxml')
    content = soup.select_one('td .defaultText')

    submenus = [i.text for i in content.find_all('span', attrs={'style':"color:#B22222"})]
    for index, i in enumerate(submenus):
        if re.search('[Aa]rticles', i):
            break
    start_text = i
    end_text = submenus[index+1]

    flag = 0
    list_ = []
    for s in content.find_all('span'):
        if not flag:
            if s.text != start_text:
                continue
            else:
                flag = 1
                continue
        if s.text == end_text:
            break
        try:
            one = [s.text.strip(), s.find('a')['href']]
            # print(one)
            list_.append(one)
        except TypeError:
            pass
    return list_

for i in pythonweekly_recent_issues_urls:
    # [text, url, list] 
    print(i[0])
    i.append(get_single_issue_info(i[1]))

pythonweekly_recent_issues = pythonweekly_recent_issues_urls


def baidu_translate(query):
    '''
    http://api.fanyi.baidu.com/api/trans/product/apidoc
    '''
    from hashlib import md5
    import random

    url = 'http://api.fanyi.baidu.com/api/trans/vip/translate'
    fromLang = 'en'
    toLang = 'zh'
    salt = random.randint(32768, 65536)

    sign = appid + query + str(salt) + secretKey
    m1 = md5()
    m1.update(sign.encode('utf-8'))
    sign = m1.hexdigest()
    
    params = {'appid':appid,
              'q':query,
              'from':fromLang,
              'to':toLang,
              'salt':str(salt),
              'sign':sign,}
    res = requests.get(url, params=params)
    return res.json()['trans_result'][0]['dst']

for *_, articles in pythonweekly_recent_issues:
    for i in articles:
        i.append(baidu_translate(i[0]))
    print('done')


from jinja2 import Template
table = """
<table>
    {% for issue_num, issue_href, article_lists in issues %}
        {% for article_name, article_href, article_chinese in article_lists %}
        <tr>
            <td><a href='{{issue_href}}'>{{ issue_num }}</a></td>
            <td><a href='{{article_href}}'>{{ article_name }}</a></td>
            <td><a href='{{article_href}}'>{{ article_chinese }}</a></td>
        </tr>
        {% endfor %}
    {% endfor %}
</table>
"""

template = Template(table)
t = template.render(issues=pythonweekly_recent_issues)

import time
with open('pythonweekly_recent ' + time.ctime().replace(':', '_') + '.html', 'w', encoding='utf-8') as f:
    f.write(t)

import tldextract
host_list = [
    tldextract.extract(i[1]).domain
    for *_, articles in pythonweekly_recent_issues for i in articles ]

from collections import Counter
counter = Counter(host_list)
counter.most_common(20)


with open('pythonweekly_recent.md', 'w', encoding='utf-8') as f:
    f.write(u'### PythonWeekly文章教程近期汇总\n')
    f.write(u'|      期号     |     英文名    | 中文名|\n')
    f.write(u'| ------------- |:-------------:| -----:|\n')
    for issue_num, issue_href, article_lists in pythonweekly_recent_issues:
        for article_name, article_href, article_chinese in article_lists:
            f.write(('| [{issue_num}]({issue_href}) '
                     '| [{article_name}]({article_href}) '
                     '| [{article_chinese}]({article_href}) '
                     '| \n').format(**locals()))

参考

Python Weekly statistics (using urllib2, HTMLParser and pickle)

最后编辑于：2017.12.09 12:17:36

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,233评论 4赞 360
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,013评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,030评论 0赞 241
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,827评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,221评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,542评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,814评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,513评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,225评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,497评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,998评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,342评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,986评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,055评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,812评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,560评论 2赞 271
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,461评论 2赞 266

Python内容分享活跃站点统计与一些文章链接汇总

近期文章汇总

初期文章汇总

代码

参考

推荐阅读更多精彩内容