简书入口用户简单分析（二）——用户动态

在本人主观定义的入口用户(见简书入口用户简单分析)中选取了8143名用户，爬取他们的时间线，获取全部动态进行可视化分析

所有动态的年月变化趋势，喜欢文章、发表评论和赞赏文章的趋势基本一致，可以看出简书身为创作者社区，发文与阅读的比例还是很高的：

赞赏文章的行为在今年三月份到达高峰，16年中下半年是用户入住简书的高峰：

纵轴对数分布：
所有动态的一天24小时分布，一天的活动从早上6-8点开始，晚上9-11点有明显的阅读活动和发表文章的小高峰：
所有动态的星期分布，差异不大：

纵轴放大：

换一种看法，动态在一周内的占比分布，星期六有个较明显的下降趋势，但还是区别不大：
所有动态以月为周期按日取和统计，月末有个小下降，但31号的异常是因为某些月份没有31号：

按动态的当日占比来看，相当均衡：
几个最活跃用户（动态总数最多）的行为时间分布，作者梅话三弄和云儿飘过也分别是给别的作者打赏次数第一、第二多（1300多次打赏）的用户：
给别人打赏第三多的用户

用户打赏次数分布，差异还是相当大的，但是有过打赏行为的用户占了34.4%，还是很高的比例了。
所有用户总活跃天数（一天内有动态该天即为活跃）分布，分布并不均衡，是不是老用户的活跃天数就多呢？

将用户按入住年月分组，对他们的活跃天数取平均，可以看出16年左右入住的用户的活跃天数最多，并非是越老的用户活跃天数就越多，老用户也可能渐渐失去对平台的兴趣而流失。

活跃比率（活跃天数除以入住总天数）的分布，大部分用户小于一周一次活跃：

用户的简书入住总天数（横轴）与活跃比率（纵轴）的相关性分析，老用户的活跃度要低些，新用户正是热情高涨。

代码

# -------------------续简书入口用户简单分析代码---------------------------- #
# 获取动态
# share_note 发表文章
# like_comment 赞了评论
# like_note 喜欢了文章
# comment_note 发表评论
# like_collection 关注了专题
# reward_note 赞赏文章
# like_user 关注作者
# join_jianshu 加入简书
# like_notebook 喜欢专辑

# 数据库准备
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client.jianshu

# 辅助装饰器
from functools import wraps
def store_and_record_error(errors, coll):
    def decorator(f):
        @wraps(f)
        def wrapper(*args, **kwargs):
            try:
                # func_args = inspect.getcallargs(f, *args, **kwargs)
                res = f(*args, **kwargs)
                coll.insert_many(res)
                # return res
                # del res # 直接返回res内存占用过大
                return None
            except:
                errors.append(args[0])
                print('error', args[0])
                return None # []
        return wrapper
    return decorator

import re
errors = []
coll = db.user_active_new
@store_and_record_error(errors, coll)
@retry(Exception, delay=1, backoff=2, tries=2)
def get_active_from_single_user(id='45a15c9b5a22'):
    first_url = host + '/users/{id}/timeline?_pjax=%23list-container'.format(id=id)
    res = requests.get(first_url, headers=headers)

    # timetuple = time.strptime('2017-07-08T08:36:25', '%Y-%m-%dT%H:%M:%S')
    # datetime.datetime(*timetuple[0:6])

    infos = [] # 该用户全部动态
    match = re.compile('<li id="feed-(\d+)">')
    num = 2

    while 1:
        soup = BeautifulSoup(res.text, 'lxml')
        info = [{'id': id, # 多条动态  # todo:考虑内存占用问题
                 'action_time': i['data-datetime'].split('+')[0],
                 'action_type': i['data-type'],}
                for i in soup.select('.content .author .name span')]
        if not info:
            print('over', id)
            break
        infos.extend(info)
        max_id = int(re.findall(match, res.text)[-1]) - 1
        next_url = host + '/users/{id}/timeline?page={page_num}&max_id={max_id}'.format(
            id=id, max_id=max_id, page_num=num)
        res = requests.get(next_url, headers=headers)
        # print(num)
        num += 1

    return infos

# 4000个75分钟
# pool = Pool(30)
# iter map 暂不需要
# https://stackoverflow.com/questions/28375508/python-multiprocessing-tracking-the-process-of-pool-map-operation
# https://stackoverflow.com/questions/34827250/how-to-keep-track-of-status-with-multiprocessing-and-pool-map
# https://stackoverflow.com/questions/26520781/multiprocessing-pool-whats-the-difference-between-map-async-and-imap
# 首先要有users变量，即之前爬取的入口用户
all_active = pool.map(get_active_from_single_user, users['id'].values)
# Mongo Command: db.getCollection('user_active_new').distinct('id')

import pandas as pd
def read_data_from_mongo(coll):
    cursor = coll.find({}, {'_id': False})
    all_active = pd.DataFrame(list(cursor))#, dtype=int
    all_active.action_time = pd.to_datetime(all_active.action_time)
    all_active.action_type = all_active.action_type.astype('category')
    # 查看具体数值：all_active.action_type.cat.codes
    all_active.id = all_active.id.astype('category')
    all_active.info(memory_usage='deep')
    return all_active


coll = db.user_active_new
all_active = read_data_from_mongo(coll)

# -----------------------检查内存占用 (--------------------------- #
import sys  # dir() /globals() /locals() /vars() /whos
for var, obj in locals().items():
    print(var, sys.getsizeof(obj))
print('内存占用%sM' % (sys.getsizeof(all_active)/1048576))

import os, psutil
process = psutil.Process(os.getpid())
process.memory_info()[0] / float(2 ** 20)
process.memory_percent()
# -----------------------检查内存占用 )--------------------------- #


# 数据分析
id_active_times = all_active.groupby('id').size().to_frame(name='active_times').reset_index()
id_active_times.sort_values('active_times', inplace=True, ascending=False)
id_active_times.active_times.max()

weekday = all_active.action_time.apply(lambda x: x.isoweekday())
all_active['weekday'] = weekday
weekday = all_active.groupby('weekday').size()
weekday.plot.bar()
plt.show()

hours = all_active.action_time.apply(lambda x: x.hour)
all_active['hours'] = hours
hours = all_active.groupby('hours').size()
hours.plot.bar()
plt.show()

all_active = pd.concat([
    all_active,  
    pd.get_dummies(all_active['action_type'])], axis=1)
    
# all_active.groupby('weekday')['share_note'].sum()
# 分别画柱状图
data = all_active.groupby('weekday')
for d, t in zip(
    ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
     'like_user', 'join_jianshu', 'like_notebook'],
    ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
    '关注作者', '加入简书','喜欢专辑',]):
    temp = data[d].sum()
    fig = {
        'data': [go.Bar(x=temp.index.values,
                        y=temp.values,)],
        'layout': {'yaxis': {'title': t}},
    }
    plotly.offline.plot(fig, filename='basic_bar_%s.html'%d, show_link=False)

# 合在一起画柱状图
x = []
for d, t in zip(
    ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
     'like_user', 'join_jianshu', 'like_notebook'],
    ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
    '关注作者', '加入简书','喜欢专辑',]):
    temp = data[d].sum()
    x.append(go.Bar(x=temp.index.values,
                    y=temp.values,
                    name=t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)

# 折线图
x = []
for d, t in zip(
    ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
     'like_user', 'join_jianshu', 'like_notebook'],
    ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
    '关注作者', '加入简书','喜欢专辑',]):
    temp = data[d].sum()
    # 是否看占比
    temp = temp / temp.sum()
    x.append(go.Scatter(x=temp.index.values,
                        y=temp.values,
                        mode = 'lines',
                        name = t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)

# 各种行为按hour统计
data = all_active.groupby('hours')
x = []
for d, t in zip(
    ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
     'like_user', 'join_jianshu', 'like_notebook'],
    ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
    '关注作者', '加入简书','喜欢专辑',]):
    temp = data[d].sum()
    # temp = temp / temp.sum()
    x.append(go.Scatter(x=temp.index.values,
                        y=temp.values,
                        mode = 'lines',
                        name = t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)

# 各种行为按day统计
all_active['day'] = all_active.action_time.apply(lambda x: x.day)
data = all_active.groupby('day')
x = []
for d, t in zip(
    ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
     'like_user', 'join_jianshu', 'like_notebook'],
    ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
    '关注作者', '加入简书','喜欢专辑',]):
    temp = data[d].mean()
    # temp = temp / temp.sum()
    x.append(go.Scatter(x=temp.index.values,
                        y=temp.values,
                        mode = 'lines+markers',
                        name = t))
plotly.offline.plot(x, filename='basic_bar.html', show_link=False)


# 加入简书的行为按年月统计
join_jianshu = all_active[all_active['action_type'] == 'join_jianshu']
# test: a = pd.datetime.strptime('2014-10-09 11:34:45', '%Y-%m-%d %H:%M:%S')
join_jianshu['year_month'] = join_jianshu['action_time'].dt.strftime('%Y-%m')
data = join_jianshu.groupby('year_month').size()

plotly.offline.plot([go.Scatter(
    x = data.index.values,
    y = data.values,
    mode = 'lines+markers',
    name = 'lines'
)], filename='lineZ.html', show_link=False)

# 所有行为按年月统计
all_active['year_month'] = all_active['action_time'].dt.strftime('%Y-%m')
# bug: categories is not json serializable:
# all_active['year_month'] = all_active['year_month'].astype('category')
data = all_active.groupby('year_month')
x = []
for d, t in zip(
    ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
     'like_user', 'join_jianshu', 'like_notebook'],
    ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
    '关注作者', '加入简书','喜欢专辑',]):
    temp = data[d].sum()
    x.append(go.Scatter(x=temp.index.values,
                        y=temp.values,
                        mode = 'lines+markers',
                        name = t))
fig = {'data': x,
       # 是否取对数坐标
       # 'layout': {'xaxis': {'title': '年月'}, 'yaxis': {'type': 'log'}}
}
plotly.offline.plot(fig, filename='basic_bar.html', show_link=False)


# todo: 最后一个月发表文章的占比

# 找出最活跃用户并按年月统计
most_active_user = all_active.groupby('id').size().sort_values(ascending=False).index[:5]
for mu in most_active_user:
    data = all_active[all_active['id'] == mu].groupby('year_month')
    x = []
    for d, t in zip(
        ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
         'like_user', 'join_jianshu', 'like_notebook'],
        ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
        '关注作者', '加入简书','喜欢专辑',]):
        temp = data[d].sum() # .mean() # 注意这里的含义
        x.append(go.Scatter(x=temp.index.values,
                            y=temp.values,
                            mode='lines+markers',
                            name=t))
    fig = {'data': x,
           # 是否取对数坐标
           # 'layout': {'xaxis': {'title': '年月'}, 'yaxis': {'type': 'log'}}
    }
    plotly.offline.plot(fig, filename='most_active_user_%s.html'%mu, show_link=False)

# 用户注册后的活跃天数占比
all_active['just_date'] = all_active.action_time.dt.date
active_days = all_active.groupby('id')['just_date'].apply(lambda x: x.nunique())
active_days.sort_values(ascending=False, inplace=True)
# - - 用户总活跃天数（一天内有动态该天即为活跃）分布
plotly.offline.plot([go.Bar(y=active_days)], filename='active_days.html', show_link=False)
# 算出从注册至今的总天数，然后与总活跃天数相比
now = pd.datetime.now()
join_during_time = all_active[
    all_active['action_type'] == 'join_jianshu'
][['id', 'action_time']]
join_during_time['during_time'] = now - join_during_time['action_time']
join_during_time['during_time'] = join_during_time.during_time.dt.days + 1
join_during_time = pd.merge(join_during_time,
    active_days.reset_index(name='active_days'))
join_during_time['ratio'] = join_during_time['active_days'] / join_during_time['during_time']
# 饼图
labels = ['小于10%', '大于10%小于50%', '大于50%小于90%', '大于90%']
ratio = join_during_time['ratio']
values = [(ratio > i).sum() for i in [0.1, 0.5, 0.9]]
values[1] -= values[2]
values.insert(1, len(ratio)-sum(values))
trace = go.Pie(labels=labels, values=values)
plotly.offline.plot([trace], filename='active_days.html', show_link=False)
# 直方图与相关性分布图
join_during_time['ratio'].hist()
sns.jointplot(data=join_during_time, x='during_time', y='ratio', kind='reg', color='g')
sns.plt.show()

# 按注册年月显示的平均活跃天数
join_during_time['year_month'] = join_during_time['action_time'].dt.strftime('%Y-%m')
temp = join_during_time.groupby('year_month')['active_days'].mean()
plotly.offline.plot([go.Scatter(x=temp.index.values,
                               y=temp.values,
                               mode = 'lines+markers')],
                    filename='average_active_days.html',
                    show_link=False)

# 打赏次数最多的用户的行为
most_reward_user = all_active.groupby('id')['reward_note'].sum().sort_values(ascending=False).index[2]
data = all_active[all_active['id'] == most_reward_user].groupby('year_month')
x = []
for d, t in zip(
    ['share_note', 'like_comment', 'like_note', 'comment_note', 'like_collection', 'reward_note',
     'like_user', 'join_jianshu', 'like_notebook'],
    ['发表文章', '赞了评论', '喜欢了文章', '发表评论', '关注了专题', '赞赏文章',
    '关注作者', '加入简书','喜欢专辑',]):
    temp = data[d].sum()
    x.append(go.Scatter(x=temp.index.values,
                        y=temp.values,
                        mode = 'lines+markers',
                        name = t))
fig = {'data': x,
       # 是否取对数坐标
       # 'layout': {'xaxis': {'title': '年月'}, 'yaxis': {'type': 'log'}}
}
plotly.offline.plot(fig, filename='most_reward_user_%s.html'%most_reward_user, show_link=False)

# 打赏用户占比
reward = all_active.groupby('id')['reward_note'].sum().sort_values(ascending=False)
print((reward > 0).sum() / len(reward))
plotly.offline.plot([go.Bar(y=reward)], filename='reward.html', show_link=False)

其他

热门文章爬取(前两页)

def get_hot_articles_from_single_user(id='45a15c9b5a22'):
    hot_first_url = host + '/u/{id}?order_by=top&_pjax=%23list-container'.format(id=id)
    hot_next_url = host + '/u/{id}?order_by=top&page={page_num}'.format(id=id, page_num=2)
    user_hot_articles = []
    for u in [hot_first_url, hot_next_url]:
        res = requests.get(u, headers=headers)
        soup = BeautifulSoup(res.text, 'lxml')
        for info in soup.select('.content'):
            title = info.select_one('.title').text
            time_ = info.select_one('.time')['data-shared-at']
            details = info.select_one('.meta')
            read, comments = [i.text.strip() for i in details.findAll('a')]
            like_money = [i.text.strip() for i in details.findAll('span')]
            if len(like_money) == 1:
                like = like_money[0]
                money = 0
            else:
                like, money = like_money
            user_hot_articles.append({
                'title': title,
                'time': time_,
                'read': read,
                'comments': comments,
                'like': like,
                'money': money,
            })

最后编辑于：2017.12.08 16:22:37

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,015评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,262评论 1赞 292
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,727评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,986评论 0赞 205
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,363评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,610评论 1赞 219
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,871评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,582评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,297评论 1赞 242
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,551评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,053评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,385评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,035评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,079评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,841评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,648评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,550评论 2赞 270

简书入口用户简单分析（二）——用户动态

代码

其他

推荐阅读更多精彩内容