机器学习笔记 - 22. 主题模型LDA实践（讲师：邹博）

课前问答

问：在22.6代码中往LDA中喂数据的时候，为什么要计算TF/ IDF?
答：一会解释，不算也可以

主要内容

2019-02-12 10_33_11-机器学习第七期升级版.png

做LDA的时候，可以用TF-IDF做一个变换。
在给LDA喂数据的时候，我们认为整个LDA是一个词袋模型，每一个词是独立的，将该词出现的次数数出来，其实就是各个词的强度。

提到通过爬虫爬数据，获取数据源

朴素贝叶斯

2019-02-12 11_29_11-机器学习第七期升级版.png

2019-02-12 14_41_23-机器学习第七期升级版.png

比如鸢尾花数据，有四个特征，即花萼长度，花萼宽度，花瓣长度，花瓣宽度，利用这四个特征，进行分类，分类结果为y（零散的0,1, 2三个值）。
利用贝叶斯公式可得：
P(y|花萼长度，花萼宽度，花瓣长度，花瓣宽度) = P(y)P(花萼长度，花萼宽度，花瓣长度，花瓣宽度|y) / P(花萼长度，花萼宽度，花瓣长度，花瓣宽度)

2019-02-12 14_49_46-机器学习第七期升级版.png

P(y)的含义：决定了y到底是0,1, 2哪一个的发生概率。
比如在150个鸢尾花数据中，有：
100: 0
30: 1
20: 2
假定从这个150个数据，随机拿出一条数据，是0的概率是2/3， 1的概率是1/5， 2的概率是2/15
即P(y)是先验概率。
P(x_i|y)的含义，假定y是女生，x_i是身高，且假定符合高斯分布，假定女生的身高均值是1.62米，如果x_i是1.86米，则P(x_i|y)的值就会低一些。
一般来说P(y)的概率，可以简单计算可得；
但是P(x_i|y)就需要建模，但是需要建模成什么样的分布（高斯分布，多项分布，伯努利分布。。。）

2019-02-12 15_03_25-机器学习第七期升级版.png

2019-02-12 15_04_38-机器学习第七期升级版.png

对于多项式分布来说，如果x_i的个数是n，y的类别离散值为k，则有Kxn个参数
比如，P(x=3|y=0)，表示(y=0时，x=3出现了多少次)/ (y=0一共出现多少次)，这个值就是频率，即用频率估计概率。
分子加上α, 分母加上α x n（为了保证加和为1），因此公式有可能是这样子的，对于文档做LDA的意义：第一，为了防止过拟合；第二，为了避免出现未登录词，分母为0.
即，如上图的公式。
附注：拉普拉斯平滑

2019-02-12 15_22_17-机器学习第七期升级版.png

通过高斯朴素贝叶斯，对鸢尾花做分类
以下代码，对邹博原有的代码做了一些修正：

feature_names加入了“类别”，修正串列的问题
features修改为花瓣长度与花瓣宽度，因为鸢尾花数据的花瓣长度起到分类决定性的作用

问答
问：朴素贝叶斯，朴素的点在哪？
答：之所以朴素，因为我们认为特征是相对条件独立的。但对于现实世界的认识，说实话是不对的。比如用身高、体重、腰围，推断某人性别，用朴素贝叶斯的话，根据公式：

2019-02-12 15_25_43-机器学习第七期升级版.png

可以推导，如果y是男性，其身高的概率密度，体重的概率密度，腰围的概率密度，但是根据朴素贝叶斯的定义，身高，体重，腰围应该是独立的，但从实际情况出发，从来不是。比如体重如果比较重，腰围一般粗一些；身高如果比较高，体重一般不会太轻，比如身高1米9，体重80斤的概率就非常非常小。
但是我们又可以通过假设的发散性，来解释朴素贝叶斯的应用：

2019-02-12 15_29_52-机器学习第七期升级版.png

此外，还有假设的内涵型：

2019-02-12 15_31_17-机器学习第七期升级版.png

假设的简化性：

2019-02-12 15_31_58-机器学习第七期升级版.png

朴素贝叶斯，还有一个特性是特征是均衡的。即它认为这些数是直接相乘的出来的：

2019-02-12 15_25_43-机器学习第七期升级版.png

问：朴素贝叶斯与贝叶斯有什么不同？
答：完全是两个东西啊。朴素贝叶斯是应用贝叶斯公式得出结论的。贝叶斯，我们往往指贝叶斯先验，就是说我们想求取参数，并不认为参数是未知或定知，而是认为其实随机变量，那么就是属于贝叶斯了。
问：如果有新词会不会就有某个p(x)=0?
答：是的，多项式朴素贝叶斯公式中的分子加α，分母加上α x n就是为了处理这种情况的。
问：如果考虑相关性呢？
答：如果考虑相关性，我们就退化为普通的贝叶斯网络了，特征间有连接了。所以朴素贝叶斯，就是贝叶斯网络的一个特殊情况，没有边，只有y与x_i的连接，弧段的简单的贝叶斯网络。
问：如何去训练那个beta参数呢？
答：是有可能的。可以先验去训练一个beta参数，我们认为beta本身是服从一个分布，就可以训练。
问：有些独立假设在各个分类之间的分布都是均匀的，所以对于似然的相对大小不产生影响。即便不是如此，也有很大可能性，各个独立假设所产生的消极影响或积极影响互相抵消，最终导致结果受到的影响不大？
答：前面一句没什么问题。但是后面一句，不能说相互抵消，不一定这么乐观的去想，不能说相互抵消，而是发生震荡。对于复杂模型，需要验证是否能够做独立假设的前提。

在scikit-learn中，一共有3个朴素贝叶斯的分类算法类。分别是GaussianNB，MultinomialNB和BernoulliNB。其中GaussianNB就是先验为高斯分布的朴素贝叶斯，MultinomialNB就是先验为多项式分布的朴素贝叶斯，而BernoulliNB就是先验为伯努利分布的朴素贝叶斯。

这三个类适用的分类场景各不相同，一般来说，如果样本特征的分布大部分是连续值，使用GaussianNB会比较好。如果样本特征的分布大部分是多元离散值，使用MultinomialNB比较合适。而如果样本特征是二元离散值或者很稀疏的多元离散值，应该使用BernoulliNB。

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
# GaussianNB, 先验为高斯分布的朴素贝叶斯
# MultinomialNB, 先验为多项式分布的朴素贝叶斯
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


def iris_type(s):
    it = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
    return it[s]


if __name__ == "__main__":
    data_type = 'iris'  # iris

    if data_type == 'car':
        colmun_names = 'buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'acceptability'
        data = pd.read_csv('car.data', header=None, names=colmun_names)
        for col in colmun_names:
            data[col] = pd.Categorical(data[col]).codes
        x = data[list(colmun_names[:-1])]
        y = data[colmun_names[-1]]
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)
        model = MultinomialNB(alpha=1)
        model.fit(x_train, y_train)
        y_train_pred = model.predict(x_train)
        print('CAR训练集准确率：', accuracy_score(y_train, y_train_pred))
        y_test_pred = model.predict(x_test)
        print('CAR测试集准确率：', accuracy_score(y_test, y_test_pred))
    else:
        feature_names = '花萼长度', '花萼宽度', '花瓣长度', '花瓣宽度', '类别'
        data = pd.read_csv('..\\9.Regression\\iris.data', header=None, names=feature_names)
        x, y = data[list(feature_names[:-1])], data[feature_names[-1]]
        y = pd.Categorical(values=data['类别']).codes
        # features = ['花萼长度', '花萼宽度']
        # 鸢尾花数据，花瓣长度起到决定性的作用
        features = ['花瓣长度', '花瓣宽度']
        x = x[features]
        x, x_test, y, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

        priors = np.array((1,2,4), dtype=float)
        priors /= priors.sum()
        gnb = Pipeline([
            ('sc', StandardScaler()),
            ('poly', PolynomialFeatures(degree=1)),
            ('clf', GaussianNB(priors=priors))])    # 由于鸢尾花数据是样本均衡的，其实不需要设置先验值
        # gnb = KNeighborsClassifier(n_neighbors=3).fit(x, y.ravel())
        gnb.fit(x, y.ravel())
        y_hat = gnb.predict(x)
        print('IRIS训练集准确度: %.2f%%' % (100 * accuracy_score(y, y_hat)))
        y_test_hat = gnb.predict(x_test)
        print('IRIS测试集准确度：%.2f%%' % (100 * accuracy_score(y_test, y_test_hat)))  # 画图

        N, M = 500, 500     # 横纵各采样多少个值
        x1_min, x2_min = x.min()
        x1_max, x2_max = x.max()
        t1 = np.linspace(x1_min, x1_max, N)
        t2 = np.linspace(x2_min, x2_max, M)
        x1, x2 = np.meshgrid(t1, t2)                    # 生成网格采样点
        x_grid = np.stack((x1.flat, x2.flat), axis=1)   # 测试点

        mpl.rcParams['font.sans-serif'] = ['simHei']
        mpl.rcParams['axes.unicode_minus'] = False
        cm_light = mpl.colors.ListedColormap(['#77E0A0', '#FF8080', '#A0A0FF'])
        cm_dark = mpl.colors.ListedColormap(['g', 'r', 'b'])
        y_grid_hat = gnb.predict(x_grid)                  # 预测值
        y_grid_hat = y_grid_hat.reshape(x1.shape)
        plt.figure(facecolor='w')
        plt.pcolormesh(x1, x2, y_grid_hat, cmap=cm_light)     # 预测值的显示
        plt.scatter(x[features[0]], x[features[1]], c=y, edgecolors='k', s=30, cmap=cm_dark)
        plt.scatter(x_test[features[0]], x_test[features[1]], c=y_test, marker='^', edgecolors='k', s=40, cmap=cm_dark)

        plt.xlabel(features[0], fontsize=13)
        plt.ylabel(features[1], fontsize=13)
        plt.xlim(x1_min, x1_max)
        plt.ylim(x2_min, x2_max)
        plt.title('GaussianNB对鸢尾花数据的分类结果', fontsize=18)
        plt.grid(True, ls=':', color='#202020')
        plt.show()

结果如下：
IRIS训练集准确度: 96.19%
IRIS测试集准确度：97.78%

2019-02-12 11_27_42-Figure 1.png

问答
问：如果用花萼长度与花萼宽度，是因为鸢尾花数据线性相关，所以高斯朴素贝叶斯效果不好么？
答：不是的。是因为知道只用花萼长度与花萼宽度，因为特征不好，所以效果一定好不了。

LDA的实现

2019-02-12 15_44_43-机器学习第七期升级版.png

VBEM：变分期望最大化
Gensim：LDA的实现，是改进了David Blei的LDA-C的算法，并且使用了在线变分。

2019-02-12 18_20_03-机器学习第七期升级版.png

LSI/ LFM/ ICA本质都可以看做一个矩阵的分解。
只要求出语料的LDA，就能计算出任意两个语料之间的相似度

2019-02-12 18_22_28-机器学习第七期升级版.png

我们使用余弦相似度的话，其实余弦相似度0~180，取值范围是1~-1的，负值的概念是不止不相似，甚至相反，即背道而驰的两个方向。
类似于：终结者与武侠两个词，一个属于科幻，一个属于传统，它们的套路与想法完全不同。也许就是负的。

2019-02-12 18_25_52-机器学习第七期升级版.png

不管用LDA还是LSA，都能算出任何一个文档，属于某个主题的概率

以及某主题，前n个重要的词

2019-02-12 18_28_10-机器学习第七期升级版.png

通过爬虫爬出新闻语料

2019-02-12 18_29_27-机器学习第七期升级版.png

通过TF/ IDF模型得到文档的每一个词的向量，然后喂给LDA，就能够做每一个文档的主题分布，以及观察每个主题分布在哪

2019-02-12 18_30_41-机器学习第七期升级版.png

2019-02-12 18_31_37-机器学习第七期升级版.png

图形化主题与主题分布。

如图，可以观察各个文档，最突出的主题是什么
也可以观察每个主题的词分布情况

2019-02-12 18_32_30-机器学习第七期升级版.png

示例1：

2019-02-12 18_36_09-机器学习第七期升级版.png

可以观察出现最多的主题

2019-02-12 18_37_27-机器学习第七期升级版.png

示例2：

2019-02-12 18_40_12-机器学习第七期升级版.png

2019-02-12 18_41_57-机器学习第七期升级版.png

2019-02-12 18_50_58-机器学习第七期升级版.png

QQ聊天记录主题分析的相关代码
聊天记录就不方便放在这里了，大家也可以自己去收集。
格式如下：

2017-03-07 14:26:44 系统消息(10000)
Joney__加入本群

2017-03-07 20:04:30 美丽草原我家(61xxxx17)
大家好

2017-03-07 20:06:58 2029-皮皮兔(4xxxx066)
好

2017-03-07 20:07:19 2029-皮皮兔(4xxxx066)
也好

2017-03-07 20:08:00 why(7xxxx46252)
都好

2017-03-07 20:10:41 2029-皮皮兔(4xxxx066)
才是真的好

保存为QQChat.txt，与代码文件放在同一个目录即可。

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from gensim import corpora, models, similarities
from pprint import pprint
import time
import matplotlib as mpl
import matplotlib.pyplot as plt
import re
import pandas as pd
import jieba
import jieba.posseg


# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


def load_stopword():
    f_stop = open('stopword.txt')
    sw = [line.strip() for line in f_stop]
    f_stop.close()
    return sw


def clean_info(info):
    replace_str = (('\n', ''), ('\r', ''), (',', '，'), ('[表情]', ''))
    for rs in replace_str:
        info = info.replace(rs[0], rs[1])

    at_pattern = re.compile(r'(@.* )')
    at = re.findall(pattern=at_pattern, string=info)
    for a in at:
        info = info.replace(a, '')
    idx = info.find('@')
    if idx != -1:
        info = info[:idx]
    return info


def regularize_data(file_name):
    time_pattern = re.compile(r'\d{4}-\d{2}-\d{2} \d{1,2}:\d{1,2}:\d{1,2}')
    qq_pattern1 = re.compile(r'([1-9]\d{4,15})')    # QQ号最小是10000
    qq_pattern2 = re.compile(r'(\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)')
    f = open(file_name, encoding='utf-8')
    f_output = open('QQ_chat.csv', mode='w', encoding='utf-8')
    f_output.write('QQ,Time,Info\n')
    qq = chat_time = info = ''
    for line in f:
        line = line.strip()
        if line:
            t = re.findall(pattern=time_pattern, string=line)
            qq1 = re.findall(pattern=qq_pattern1, string=line)
            qq2 = re.findall(pattern=qq_pattern2, string=line)
            if (len(t) >= 1) and ((len(qq1) >= 1) or (len(qq2) >= 1)):
                if info:
                    info = clean_info(info)
                    if info:
                        info = '%s,%s,%s\n' % (qq, chat_time, info)
                        f_output.write(info)
                        info = ''
                if len(qq1) >= 1:
                    qq = qq1[0]
                else:
                    qq = qq2[0][0]
                chat_time = t[0]
            else:
                info += line
    f.close()
    f_output.close()


def load_stopwords():
    stopwords = set()
    f = open('./stopword.txt', encoding='GBK')
    for w in f:
        stopwords.add(w.strip())
    f.close()
    return stopwords


def segment():
    stopwords = load_stopwords()
    data = pd.read_csv('QQ_chat.csv', header=0, encoding='utf-8')
    for i, info in enumerate(data['Info']):
        info_words = []
        for word, pos in jieba.posseg.cut(info):
            if pos in ['n', 'nr', 'ns', 'nt', 'nz', 's', 't', 'v', 'vd', 'vn', 'z', 'a', 'ad', 'an', 'f', 'i', 'j', 'Ng']:
                if word not in stopwords:
                    info_words.append(word)
        if info_words:
            data.iloc[i, 2] = ' '.join(info_words)
        else:
            data.iloc[i, 2] = np.nan
    data.dropna(axis=0, how='any', inplace=True)
    data.to_csv('QQ_chat_segment.csv', sep=',', header=True, index=False, encoding='utf-8')


def combine():
    data = pd.read_csv('QQ_chat_segment.csv', header=0, encoding='utf-8')
    data['QQ'] = pd.Categorical(data['QQ']).codes
    f_output = open('QQ_chat_result.csv', mode='w', encoding='utf-8')
    f_output.write('QQ,Info\n')
    for qq in data['QQ'].unique():
        info = ' '.join(data[data['QQ'] == qq]['Info'])
        str = '%s,%s\n' % (qq, info)
        f_output.write(str)
    f_output.close()


def export_perplexity1(corpus_tfidf, dictionary, corpus):
    lp1 = []
    lp2 = []
    topic_nums = np.arange(2, 51)
    for t in topic_nums:
        model = models.LdaModel(corpus_tfidf, num_topics=t, id2word=dictionary,
                                alpha=0.001, eta=0.02, minimum_probability=0,
                                update_every=1, chunksize=1000, passes=20)
        lp = model.log_perplexity(corpus)
        print('t = ', t, end=' ')
        print('lda.log_perplexity(corpus) = ', lp, end=' ')
        lp1.append(lp)

        lp = model.log_perplexity(corpus_tfidf)
        print('\t lda.log_perplexity(corpus_tfidf) = ', lp)
        lp2.append(lp)
    print(lp1)
    print(lp2)
    column_names = 'Topic', 'Perplexity_Corpus', 'Perplexity_TFIDF'
    perplexity_topic = pd.DataFrame(data=list(zip(topic_nums, lp1, lp2)), columns=column_names)
    perplexity_topic.to_csv('perplexity.csv', header=True, index=False)


def export_perplexity2(corpus_tfidf, dictionary, corpus):
    lp1 = []
    lp2 = []
    t = 20
    passes = np.arange(1, 20)
    for p in passes:
        model = models.LdaModel(corpus_tfidf, num_topics=t, id2word=dictionary,
                                alpha=0.001, eta=0.02, minimum_probability=0,
                                update_every=1, chunksize=100, passes=p)
        lp = model.log_perplexity(corpus)
        print('t = ', t, end=' ')
        print('lda.log_perplexity(corpus) = ', lp, end=' ')
        lp1.append(lp)

        lp = model.log_perplexity(corpus_tfidf)
        print('\t lda.log_perplexity(corpus_tfidf) = ', lp)
        lp2.append(lp)
    print(lp1)
    print(lp2)
    column_names = 'Passes', 'Perplexity_Corpus', 'Perplexity_TFIDF'
    perplexity_topic = pd.DataFrame(data=list(zip(passes, lp1, lp2)), columns=column_names)
    perplexity_topic.to_csv('perplexity2.csv', header=True, index=False)


def lda(export_perplexity=False):
    np.set_printoptions(linewidth=300)
    data = pd.read_csv('QQ_chat_result.csv', header=0, encoding='utf-8')
    texts = []
    for info in data['Info']:
        texts.append(info.split(' '))
    M = len(texts)
    print('文档数目：%d个' % M)
    # pprint(texts)

    print('正在建立词典 --')
    dictionary = corpora.Dictionary(texts)
    V = len(dictionary)
    print('正在计算文本向量 --')
    corpus = [dictionary.doc2bow(text) for text in texts]
    print('正在计算文档TF-IDF --')
    t_start = time.time()
    corpus_tfidf = models.TfidfModel(corpus)[corpus]
    print('建立文档TF-IDF完成，用时%.3f秒' % (time.time() - t_start))
    print('LDA模型拟合推断 --')
    num_topics = 20
    t_start = time.time()
    lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                          alpha=0.001, eta=0.02, minimum_probability=0,
                          update_every=1, chunksize=1000, passes=20)
    print('LDA模型完成，训练时间为\t%.3f秒' % (time.time() - t_start))
    if export_perplexity:
        export_perplexity1(corpus_tfidf, dictionary, corpus)
        # export_perplexity2(corpus_tfidf, dictionary, corpus)
    # # 所有文档的主题
    # doc_topic = [a for a in lda[corpus_tfidf]]
    # print 'Document-Topic:\n'
    # pprint(doc_topic)

    num_show_term = 7  # 每个主题显示几个词
    print('每个主题的词分布：')
    for topic_id in range(num_topics):
        print('主题#%d：\t' % topic_id, end=' ')
        term_distribute_all = lda.get_topic_terms(topicid=topic_id)
        term_distribute = term_distribute_all[:num_show_term]
        term_distribute = np.array(term_distribute)
        term_id = term_distribute[:, 0].astype(np.int)
        for t in term_id:
            print(dictionary.id2token[t], end=' ')
        print('\n概率：\t', term_distribute[:, 1])

    # 随机打印某10个文档的主题
    np.set_printoptions(linewidth=200, suppress=True)
    num_show_topic = 10  # 每个文档显示前几个主题
    print('10个用户的主题分布：')
    doc_topics = lda.get_document_topics(corpus_tfidf)  # 所有文档的主题分布
    idx = np.arange(M)
    np.random.shuffle(idx)
    idx = idx[:10]
    for i in idx:
        topic = np.array(doc_topics[i])
        topic_distribute = np.array(topic[:, 1])
        # print topic_distribute
        topic_idx = topic_distribute.argsort()[:-num_show_topic - 1:-1]
        print(('第%d个用户的前%d个主题：' % (i, num_show_topic)), topic_idx)
        print(topic_distribute[topic_idx])
    # 显示着10个文档的主题
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(8, 7), facecolor='w')
    for i, k in enumerate(idx):
        ax = plt.subplot(5, 2, i + 1)
        topic = np.array(doc_topics[i])
        topic_distribute = np.array(topic[:, 1])
        ax.stem(topic_distribute, linefmt='g-', markerfmt='ro')
        ax.set_xlim(-1, num_topics + 1)
        ax.set_ylim(0, 1)
        ax.set_ylabel("概率")
        ax.set_title("用户 {}".format(k))
        plt.grid(b=True, axis='both', ls=':', color='#606060')
    plt.xlabel("主题", fontsize=13)
    plt.suptitle('用户的主题分布', fontsize=15)
    plt.tight_layout(1, rect=(0, 0, 1, 0.95))
    plt.show()

    # 计算各个主题的强度
    print('\n各个主题的强度:\n')
    topic_all = np.zeros(num_topics)
    doc_topics = lda.get_document_topics(corpus_tfidf)  # 所有文档的主题分布
    for i in np.arange(M):  # 遍历所有文档
        topic = np.array(doc_topics[i])
        topic_distribute = np.array(topic[:, 1])
        topic_all += topic_distribute
    topic_all /= M  # 平均
    idx = topic_all.argsort()
    topic_sort = topic_all[idx]
    print(topic_sort)
    plt.figure(facecolor='w')
    plt.stem(topic_sort, linefmt='g-', markerfmt='ro')
    plt.xticks(np.arange(idx.size), idx)
    plt.xlabel("主题", fontsize=13)
    plt.ylabel("主题出现概率", fontsize=13)
    plt.title('主题强度', fontsize=15)
    plt.grid(b=True, axis='both', ls=':', color='#606060')
    plt.show()


def show_perplexity():
    data = pd.read_csv('Perplexity2.csv', header=0)
    print(data)
    columns = list(data.columns)
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.plot(data[columns[0]], data[columns[1]], 'ro-', lw=2, ms=6, label='Log Perplexity(Corpus)')
    # plt.plot(data[columns[0]], data[columns[2]], 'go--', lw=2, ms=6, label='Log Perplexity(TFIDF)')
    plt.legend(loc='lower left')
    plt.xlabel(columns[0], fontsize=16)
    plt.ylabel(columns[1], fontsize=16)
    plt.title('Perplexity', fontsize=18)
    plt.grid(b=True, axis='both', ls=':', color='#606060')
    plt.show()


if __name__ == '__main__':
    print('regularize_data')
    regularize_data('./QQChat.txt')
    print('segment')
    segment()
    print('combine')
    combine()
    print('lda')
    lda(export_perplexity=False)
    # show_perplexity()

问答
问：LDA代公式不是很简单么？居然还有这么多实现
答：是的。比如说决策树，实现方式有上万种。因为好多程序员都会自己写一个决策树的。LDA写起来还是有一些麻烦，所以现在调用Gensim的LDA实现是常见的方式
问：使用LDA可以让机器自己写某些文章吧？比如让机器写总结
答：不行。用TextRank或RNN有一定的可能性。
问：LDA求余弦都是正的？
答：因为LDA求出的所有主题分布与词分布，数值都是正的，起码>0，那么求得余弦值都是正的。但是LSI或LSA，是基于矩阵分解的方式，有可能求出负值，即主题分布，词分布可能是负值。所以可以算出负相似度
问：语音识别时，转换出文字，一个音都有好几个字对应，机器怎样给出哪个字呢？
答：其实是我们进行语音识别的时候，其过程不是简单的输出字。其实是进行分帧，比如20毫秒，将语音模型切成一个帧，每一帧用语音模型变成一个向量。将其作为输入，喂给某个RNN神经网络，如LSTM，输出一个它的值。如用CTC作为其损失函数，然后就可以学习与更新RNN的权重。所以是拼出来的。
问：主题是一个词么？
答：不是的。我们只能给出主题的关键词是什么，换句话来说，主题的前n个词。
问：从熵的角度大概是因为峰陡的熵小，信息量大；如果均匀分布熵很大，信息就少了
答：这句话本身是对的。我们对熵取指数的话，即complexity，有人将其翻译为困惑度，可以当做我们最终聚类的主题模型的好坏程度的度量标准。
问：主题是一个词或几个词的组合么?
答：是的，是组合。
问：如果是那样，怎么知道乔峰对应的是武侠，而不是科幻呢？
答：我们只能知道乔峰是5号主题，5号主题里面经常出现武侠有关的词。所以我们认为5号主题与武侠有关。
问：还有LDA如何对付连续性变量，比如我有某个变量的连续多个观测数据，我想用LDA找到这个变量的某种特征
答：要么用LDA将其打散，要么建模为其他的分布，比如高斯分布，或者直接用RNN来做。
问：对算法的测试有什么看法么？我们需要针对我们的产品做推荐算法，内容排序等各种算法的测试
答：AB测试
问：所以总结出的主题一定是文档里的词？
答：是的
问：无法根据主题下那些词提炼出真正的中心思想？
答：是的。由人给出中心思想。
问：随机森林如果增加树的深度，增加树的数量之后，拟合效果还是不好，那是不是使用这个模型就不太合适了？
答：有可能。需要上更复杂的模型了。

TextRank

2019-02-13 17_49_43-机器学习第七期升级版.png

TextRank其实是Google最开始提出的PageRank的算法，做了借鉴，得出的结论：我们将每一个句子做权重，并输出
代码示例：

# coding:utf-8

from textrank4zh import TextRank4Keyword, TextRank4Sentence
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl


if __name__ == '__main__':
    f = open('./novel.txt', mode='r', encoding='utf-8')
    text = f.read()
    f.close()

    tr4w = TextRank4Keyword()
    tr4w.analyze(text=text, lower=True, window=5)
    print('关键词：')
    for item in tr4w.get_keywords(10, word_min_len=1):
        print(item['word'], item['weight'])

    tr4s = TextRank4Sentence()
    tr4s.analyze(text=text, lower=True, source = 'no_stop_words')
    data = pd.DataFrame(data=tr4s.key_sentences)
    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(facecolor='w')
    plt.plot(data['weight'], 'ro-', lw=2, ms=5, alpha=0.7, mec='#404040')
    plt.grid(b=True, ls=':', color='#606060')
    plt.xlabel('句子', fontsize=12)
    plt.ylabel('重要度', fontsize=12)
    plt.title('句子的重要度曲线', fontsize=15)
    plt.show()

    key_sentences = tr4s.get_key_sentences(num=10, sentence_min_len=2)
    for sentence in key_sentences:
        print(sentence['weight'], sentence['sentence'])

Word2Vec

2019-02-14 10_15_18-机器学习第七期升级版.png

词嵌入用的相对较多的包：Word2Vec与GloVe，都可以将词映射到向量当中去。（2018下半年，又加入了Elmo, Bert等技术）
代码示例：
这一份代码基本包含了Gensim的Word2Vec的常用用法了
相似度：就是两个向量求点乘，求夹角余弦
聚类：如离群词，或者说最不相似的词

# encoding: utf-8 -*-

from time import time
from gensim.models import Word2Vec
import os


class LoadCorpora(object):
    def __init__(self, s):
        self.path = s

    def __iter__(self):
        f = open(self.path,'r', encoding='utf-8')
        for news in f:
            yield news.split(' ')


def print_list(a):
    for i, s in enumerate(a):
        if i != 0:
            print('+', end=' ')
        print(s, end=' ')


if __name__ == '__main__':
    if not os.path.exists('news.model'):
        sentences = LoadCorpora('news.dat')
        t_start = time()
        model = Word2Vec(sentences, size=200, min_count=5, workers=8)  # 词向量维度为200，丢弃出现次数少于5次的词
        model.save('news.model')
        print('OK:', time() - t_start)

    model = Word2Vec.load('news.model')
    print(type(model))
    print('词典中词的个数：', len(model.wv.vocab))
    for i, word in enumerate(model.wv.vocab):
        print(word, end=' ')
        if i % 25 == 24:
            print()
    print()

    intrested_words = ('中国', '手机', '学习', '人民', '名义')
    print('特征向量：')
    for word in intrested_words:
        print(word, len(model[word]), model[word])
    for word in intrested_words:
        result = model.most_similar(word)
        print('与', word, '最相近的词：')
        for w, s in result:
            print('\t', w, s)

    words = ('中国', '祖国', '毛泽东', '人民')
    for i in range(len(words)):
        w1 = words[i]
        for j in range(i+1, len(words)):
            w2 = words[j]
            print('%s 和 %s 的相似度为：%.6f' % (w1, w2, model.similarity(w1, w2)))

    print('========================')
    opposites = ((['中国', '城市'], ['学生']),
                 (['男', '工作'], ['女']),
                 (['俄罗斯', '美国', '英国'], ['日本']))
    for positive, negative in opposites:
        result = model.most_similar(positive=positive, negative=negative)
        print_list(positive)
        print('-', end=' ')
        print_list(negative)
        print('：')
        for word, similar in result:
            print('\t', word, similar)

    print('========================')
    words_list = ('苹果 三星 美的 海尔', '中国 日本 韩国 美国 北京',
                  '医院 手术 护士 医生 感染 福利', '爸爸 妈妈 舅舅 爷爷 叔叔 阿姨 老婆')
    for words in words_list:
        print(words, '离群词：', model.doesnt_match(words.split(' ')))

结果为：

与中国最相近的词：
美国 0.8539955615997314
俄罗斯 0.8372732400894165
日本 0.8261638879776001
南海 0.8248181343078613
印度 0.8242015838623047
国际 0.8181694746017456
朝鲜 0.8126145601272583
菲律宾 0.8071180582046509
韩国 0.7863374948501587
领土 0.7841207981109619
与手机最相近的词：
房间 0.929143488407135
车 0.9280033111572266
老人 0.927441418170929
身体 0.9267481565475464
老师 0.9242931604385376
家长 0.9234412908554077
孩子 0.9232087135314941
然后 0.9148879051208496
对方 0.9147698283195496
寻找 0.9145939946174622
与学习最相近的词：
侧 0.9457958936691284
树立 0.9449371099472046
结构性 0.9432505965232849
思想 0.9408413767814636
人才 0.9383446574211121
创新 0.9368791580200195
把握 0.9362080097198486
注重 0.9309667944908142
供给 0.9298370480537415
事业 0.9284298419952393
与人民最相近的词：
前沿 0.8582643270492554
军队 0.8575586080551147
投资 0.8508684635162354
固定资产 0.8431071043014526
全心全意 0.840628445148468
前列 0.8372524976730347
先进 0.8303670287132263
产业 0.8247973918914795
引进 0.8219519853591919
经济带 0.8204346895217896
与名义最相近的词：
为首 0.9553987979888916
慈善机构 0.9533225893974304
司法局 0.9508923888206482
区段 0.9485028982162476
论处 0.9476956129074097
新泽西州 0.9468851089477539
违禁 0.9465380907058716
奖励 0.9462060928344727
全力以赴 0.9461012482643127
借款 0.9449468851089478
中国和祖国的相似度为：0.519305
中国和毛泽东的相似度为：0.296508
中国和人民的相似度为：0.339817
祖国和毛泽东的相似度为：0.889088
祖国和人民的相似度为：0.758998
毛泽东和人民的相似度为：0.651832
========================
中国 + 城市 - 学生：
国际 0.8523926138877869
经济 0.850591778755188
战略 0.8446991443634033
发展 0.8198585510253906
和平 0.8031147718429565
税制 0.7999919652938843
长江三角洲 0.7924644351005554
我国 0.792377233505249
全球 0.790847897529602
推动 0.7879500389099121
男 + 工作 - 女：
两学 0.7839909791946411
做好 0.7537842988967896
督察 0.7321479916572571
调试 0.7309386134147644
党校 0.7155871391296387
师德师 0.7144147157669067
谈话 0.7095764875411987
进行批评 0.7048543691635132
各级 0.7031557559967041
以此为戒 0.702857494354248
俄罗斯 + 美国 + 英国 - 日本：
法国 0.9563250541687012
印度 0.9538899660110474
俄 0.9474135637283325
安倍 0.9358516931533813
韩国 0.9355806708335876
菲律宾 0.9353019595146179
德国 0.9314655065536499
日 0.9268109202384949
中谷 0.924175500869751
朝鲜 0.9238186478614807
========================
苹果三星美的海尔离群词：海尔
中国日本韩国美国北京离群词：北京
医院手术护士医生感染福利离群词：福利
爸爸妈妈舅舅爷爷叔叔阿姨老婆离群词：妈妈

问答
问：LDA主题模型为什么用Dirichlet和多项式分布？这两个似乎不是太常用的分布，除了共轭之外，是不是试验过，对于文章这个用途来说，这两个分布最适用？
答：因为一个文档，假定有15个主题，即这个文档的主题分布就是15个数，即多项分布啊。所以用多项分布是最合适的。有了多项分布这个似然分布之后，选择其共轭分布，作为其先验是最合适的，而共轭分布就是Dirichlet分布。
其实Dirichlet分布于多项分布都是常用分布。
问：每个词的词向量是不是固定的？还是和词典有关呢？
答：每个词的词向量，会与我们的语料有影响。我们可以通过语料训练词向量出来，但是我们也能通过通用语料训练一个通常意义上的词向量。
用词向量，确实是简单，速度快，但是词向量没有解决主题模型的问题。
我们用词向量，是假设不会出现同词不同主题的情况，即词向量解决不了一次多义的问题。
问：文档向量长度不一样可以求相似度么？
答：不能。只能补齐或插值，即必须维度相同
问：每次训练得到的主题的主题词变化很大，总感觉不靠谱？
答：是，确实是没有训练完成，或者随机数发生器每次完全不同造成的。
问：Word2Vec推测出上下文的词了，然后怎么就变成向量了呢？
答：我们将模型训练好之后，不是我们的目的。目的是通过输入一个词，预测的输出，比如前面4个词，模型如果确定了，给定一个向量，输出的四个向量，应该就是结论。但是与实际输入是不一样的，那就调整词向量本身，毕竟我们要的不是网络，而是词向量本身。
问：得到每个词的word2vec后，如何得到整个文档的向量呢？如果简单的累加似乎长度就不一致了
答：可以累加。比如第一个词是花生，可能是个向量，第二个词是山药，也可能是个向量，那就可以加起来。还有一个词是许地山，也可能是个向量，也可以与花生一起加起来。这样就得到两个不同的东西了。
问：一篇文章怎么把每个词划分开呢？
答：有很多办法，从算法来看，用的最多的就是隐马尔科夫模型。如果是中文分词，现成的python包有：jieba，以及最近刚出现的pkuseg
问：累加后怕长度不对，做归一化么？
答：可以直接加起来。但如将结果除以N也可以。
问：文档相似性是用于文档分类么？
答：可以。当然也可以做别的事情，因为已经数字化了。

马尔科夫模型模拟实验

2019-02-14 11_18_53-机器学习第七期升级版.png

预测后续位置在哪：
我们认为任何一个点出现的概率，与其周边的值出现的概率相关，我们再去根据周边这个值与当前这个值的边的相似性，随机做出一个相似性矩阵。
比如从西向东的方向/ 或者余弦变换的相似度
代码如下：

# /usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import os
from matplotlib import animation
from PIL import Image


def update(f):
    global loc
    if f == 0:
        loc = loc_prime
    next_loc = np.zeros((m, n), dtype=np.float)
    for i in np.arange(m):
        for j in np.arange(n):
            next_loc[i, j] = calc_next_loc(np.array([i, j]), loc, directions)
    loc = next_loc / np.max(next_loc)
    im.set_array(loc)

    # Save
    if save_image:
        if f % 3 == 0:
            image_data = plt.cm.coolwarm(loc) * 255
            image_data, _ = np.split(image_data, (-1, ), axis=2)
            image_data = image_data.astype(np.uint8).clip(0, 255)
            output = '.\\Pic2\\'
            if not os.path.exists(output):
                os.mkdir(output)
            a = Image.fromarray(image_data, mode='RGB')
            a.save('%s%d.png' % (output, f))
    return [im]


def calc_next_loc(now, loc, directions):
    near_index = np.array([(-1, -1), (-1, 0), (-1, 1),
                  (0, -1), (0, 1),
                  (1, -1), (1, 0), (1, 1)])
    directions_index = np.array([7, 6, 5, 0, 4, 1, 2, 3])
    nn = now + near_index
    ii, jj = nn[:, 0], nn[:, 1]
    ii[ii >= m] = 0
    jj[jj >= n] = 0
    return np.dot(loc[ii, jj], directions[ii, jj, directions_index])


if __name__ == '__main__':
    np.set_printoptions(suppress=True, linewidth=300, edgeitems=8)
    np.random.seed(0)

    save_image = False
    style = 'Sin'   # Sin/Direct/Random
    m, n = 50, 100
    directions = np.random.rand(m, n, 8)

    if style == 'Direct':
        directions[:,:,1] = 10
    elif style == 'Sin':
        x = np.arange(n)
        y_d = np.cos(6*np.pi*x/n)
        theta = np.empty_like(x, dtype=np.int)
        theta[y_d > 0.5] = 1
        theta[~(y_d > 0.5) & (y_d > -0.5)] = 0
        theta[~(y_d > -0.5)] = 7
        directions[:, x.astype(np.int), theta] = 10
    directions[:, :] /= np.sum(directions[:, :])
    print(directions)

    loc = np.zeros((m, n), dtype=np.float)
    loc[m//2, n//2] = 1
    loc_prime = np.empty_like(loc)
    loc_prime = loc
    fig = plt.figure(figsize=(8, 6), facecolor='w')
    im = plt.imshow(loc/np.max(loc), cmap='coolwarm')
    anim = animation.FuncAnimation(fig, update, frames=300, interval=50, blit=True)
    plt.tight_layout(1.5)
    plt.show()

通过不同分类模型对文本分类

分类器为：多项式朴素贝叶斯，伯努利朴素贝叶斯：若干个二项分布组成的朴素贝叶斯，最近邻，随机森林，SVM
代码如下：

#!/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from time import time
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib as mpl


def test_clf(clf):
    print('分类器：', clf)
    alpha_can = np.logspace(-3, 2, 10)
    model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)
    m = alpha_can.size
    if hasattr(clf, 'alpha'):
        model.set_params(param_grid={'alpha': alpha_can})
        m = alpha_can.size
    if hasattr(clf, 'n_neighbors'):
        neighbors_can = np.arange(1, 15)
        model.set_params(param_grid={'n_neighbors': neighbors_can})
        m = neighbors_can.size
    if hasattr(clf, 'C'):
        C_can = np.logspace(1, 3, 3)
        gamma_can = np.logspace(-3, 0, 3)
        model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})
        m = C_can.size * gamma_can.size
    if hasattr(clf, 'max_depth'):
        max_depth_can = np.arange(4, 10)
        model.set_params(param_grid={'max_depth': max_depth_can})
        m = max_depth_can.size
    t_start = time()
    model.fit(x_train, y_train)
    t_end = time()
    t_train = (t_end - t_start) / (5*m)
    print('5折交叉验证的训练时间为：%.3f秒/(5*%d)=%.3f秒' % ((t_end - t_start), m, t_train))
    print('最优超参数为：', model.best_params_)
    t_start = time()
    y_hat = model.predict(x_test)
    t_end = time()
    t_test = t_end - t_start
    print('测试时间：%.3f秒' % t_test)
    acc = metrics.accuracy_score(y_test, y_hat)
    print('测试集准确率：%.2f%%' % (100 * acc))
    name = str(clf).split('(')[0]
    index = name.find('Classifier')
    if index != -1:
        name = name[:index]     # 去掉末尾的Classifier
    if name == 'SVC':
        name = 'SVM'
    return t_train, t_test, 1-acc, name


if __name__ == "__main__":
    print('开始下载/加载数据...')
    t_start = time()
    # remove = ('headers', 'footers', 'quotes')
    remove = ()
    categories = 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'
    # categories = None     # 若分类所有类别，请注意内存是否够用
    corpus_path = './corpora_data'
    data_train = fetch_20newsgroups(data_home=corpus_path, subset='train', categories=categories, shuffle=True, random_state=0, remove=remove)
    data_test = fetch_20newsgroups(data_home=corpus_path, subset='test', categories=categories, shuffle=True, random_state=0, remove=remove)
    t_end = time()
    print('下载/加载数据完成，耗时%.3f秒' % (t_end - t_start))
    print('数据类型：', type(data_train))
    print('训练集包含的文本数目：', len(data_train.data))
    print('测试集包含的文本数目：', len(data_test.data))
    print('训练集和测试集使用的%d个类别的名称：' % len(categories))
    categories = data_train.target_names
    pprint(categories)
    y_train = data_train.target
    y_test = data_test.target
    print(' -- 前10个文本 -- ')
    for i in np.arange(10):
        print('文本%d(属于类别 - %s)：' % (i+1, categories[y_train[i]]))
        print(data_train.data[i])
        print('\n\n')
    vectorizer = TfidfVectorizer(input='content', stop_words='english', max_df=0.5, sublinear_tf=True)
    x_train = vectorizer.fit_transform(data_train.data)  # x_train是稀疏的，scipy.sparse.csr.csr_matrix
    x_test = vectorizer.transform(data_test.data)
    print('训练集样本个数：%d，特征个数：%d' % x_train.shape)
    print('停止词:\n', end=' ')
    pprint(vectorizer.get_stop_words())
    feature_names = np.asarray(vectorizer.get_feature_names())

    print('\n\n===================\n分类器的比较：\n')
    clfs = (MultinomialNB(),                # 0.87(0.017), 0.002, 90.39%
            BernoulliNB(),                  # 1.592(0.032), 0.010, 88.54%
            KNeighborsClassifier(),         # 19.737(0.282), 0.208, 86.03%
            RidgeClassifier(),              # 25.6(0.512), 0.003, 89.73%
            RandomForestClassifier(n_estimators=200),   # 59.319(1.977), 0.248, 77.01%
            SVC()                           # 236.59(5.258), 1.574, 90.10%
            )
    result = []
    for clf in clfs:
        a = test_clf(clf)
        result.append(a)
        print('\n')
    result = np.array(result)
    # 行列转置，即将所有t_train作为一行，
    # 所有t_test作为一行，所有err作为一行，
    # 所有分类器的names作为一行
    time_train, time_test, err, names = result.T
    time_train = time_train.astype(np.float)
    time_test = time_test.astype(np.float)
    err = err.astype(np.float)
    x = np.arange(len(time_train))
    mpl.rcParams['font.sans-serif'] = ['simHei']
    mpl.rcParams['axes.unicode_minus'] = False
    plt.figure(figsize=(8, 6), facecolor='w')
    ax = plt.axes()
    b1 = ax.bar(x, err, width=0.25, color='#77E0A0', edgecolor='k')
    ax_t = ax.twinx()
    b2 = ax_t.bar(x+0.25, time_train, width=0.25, color='#FFA0A0', edgecolor='k')
    b3 = ax_t.bar(x+0.5, time_test, width=0.25, color='#FF8080', edgecolor='k')
    plt.xticks(x+0.5, names)
    plt.legend([b1[0], b2[0], b3[0]], ('错误率', '训练时间', '测试时间'), loc='upper left', shadow=True)
    plt.title('新闻组文本数据不同分类器间的比较', fontsize=18)
    plt.xlabel('分类器名称')
    plt.grid(True)
    plt.tight_layout(2)
    plt.show()

输出结果如下：

开始下载/加载数据...
Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
下载/加载数据完成，耗时133.020秒
数据类型： <class 'sklearn.utils.Bunch'>
训练集包含的文本数目： 2034
测试集包含的文本数目： 1353
训练集和测试集使用的4个类别的名称：
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
-- 前10个文本 --
文本1(属于类别 - alt.atheism)：
文本示例因为是邮件，内容非常多，所以略去
训练集样本个数：2034，特征个数：33809
===================

分类器的比较：

分类器： MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
5折交叉验证的训练时间为：0.685秒/(510)=0.014秒
最优超参数为： {'alpha': 0.003593813663804626}
测试时间：0.002秒
测试集准确率：89.58%

分类器： BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
5折交叉验证的训练时间为：1.255秒/(510)=0.025秒
最优超参数为： {'alpha': 0.001}
测试时间：0.006秒
测试集准确率：88.54%

分类器： KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
5折交叉验证的训练时间为：15.144秒/(514)=0.216秒
最优超参数为： {'n_neighbors': 3}
测试时间：0.145秒
测试集准确率：86.03%

分类器： RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
max_iter=None, normalize=False, random_state=None, solver='auto',
tol=0.001)
5折交叉验证的训练时间为：21.732秒/(510)=0.435秒
最优超参数为： {'alpha': 0.01291549665014884}
测试时间：0.002秒
测试集准确率：89.36%

分类器： RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=None, max_features='auto', max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,oob_score=False, random_state=None, verbose=0,warm_start=False)
5折交叉验证的训练时间为：24.694秒/(56)=0.823秒
最优超参数为： {'max_depth': 9}
测试时间：0.150秒
测试集准确率：77.46%

分类器：SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
5折交叉验证的训练时间为：270.153秒/(59)=6.003秒
最优超参数为： {'C': 100.0, 'gamma': 0.03162277660168379}
测试时间：1.731秒
测试集准确率：90.10%

textclassification.png

可以观察到，朴素贝叶斯的速度比SVM的速度要快成百上千倍。
所以如果应用文本领域，想偷个懒出个效果，使用朴素贝叶斯，可能是合适的。

问答
问：LDA可以用来做论文查重？
答：应该是可以的。可以做这个事情，但是结论做不了。因为做不了这么细，即很难出结果。但可以做论文推荐。
问：请问出现深度学习seq模型后，LDA实现的功能是否会被完全替代呢？还是LDA还有一些功能是深度学习解决不了的呢？
答：目前做自然语言用的最多的还是深度学习。LDA相比深度学习是相对轻量级的。但是可以结合用，比如句子向量喂给一个RNN，主题向量喂给一个RNN，之后用均化池，这就是主题本身的性质。然后句子语主题的特征拼在一起，最后用Attention得到一个flatten层，然后通过Softmax做一个分类器。

2019-02-19 14_35_46-机器学习第七期升级版.png

所以说，不一定必须用什么模型，还是看场景
问：MultinomialNB与GaussianNB的区别？
答：给定p(x|y)，即给定类别的时候，看看特征的概率密度，特征是多项式分布还是高斯分布
问：程序里面的分类是根据什么分类的呢？转化的向量有标签么？
答：转化的向量就是标签，y就是标签。
问：什么是Text Rank？
答：根据Google的Page Rank演化而来的，根据句子含有的相同词，去求句子之间的相似度。（我自己写了一篇TextRank相关的简书:使用TextRank快速获取一篇文章的摘要）

网易新闻的主题提取代码

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from gensim import corpora, models, similarities
from pprint import pprint
import time

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


def load_stopword():
    f_stop = open('./stopword.txt', encoding='GBK')
    sw = [line.strip() for line in f_stop]
    f_stop.close()
    return sw


if __name__ == '__main__':
    print('初始化停止词列表 --')
    t_start = time.time()
    stop_words = load_stopword()

    print('开始读入语料数据 -- ')
    f = open('./news.dat', encoding='utf-8')    #LDA_test.txt
    texts = [[word for word in line.strip().lower().split() if word not in stop_words] for line in f]
    # texts = [line.strip().split() for line in f]
    print('读入语料数据完成，用时%.3f秒' % (time.time() - t_start))
    f.close()
    M = len(texts)
    print('文本数目：%d个' % M)
    # pprint(texts)

    print('正在建立词典 --')
    dictionary = corpora.Dictionary(texts)
    V = len(dictionary)
    print('词的个数：', V)
    print('正在计算文本向量 --')
    corpus = [dictionary.doc2bow(text) for text in texts]
    print('正在计算文档TF-IDF --')
    t_start = time.time()
    corpus_tfidf = models.TfidfModel(corpus)[corpus]
    print('建立文档TF-IDF完成，用时%.3f秒' % (time.time() - t_start))
    print('LDA模型拟合推断 --')
    num_topics = 10
    t_start = time.time()
    lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                            alpha=0.01, eta=0.01, minimum_probability=0.001,
                            update_every = 1, chunksize = 100, passes=5)
    print('LDA模型完成，训练时间为\t%.3f秒' % (time.time() - t_start))
    # # 所有文档的主题
    # doc_topic = [a for a in lda[corpus_tfidf]]
    # print 'Document-Topic:\n'
    # pprint(doc_topic)

    # 随机打印某10个文档的主题
    num_show_topic = 10  # 每个文档显示前几个主题
    print('10个文档的主题分布：')
    doc_topics = lda.get_document_topics(corpus_tfidf)  # 所有文档的主题分布
    idx = np.arange(M)
    np.random.shuffle(idx)
    idx = idx[:10]
    for i in idx:
        topic = np.array(doc_topics[i])
        print('topic = \t', topic)
        topic_distribute = np.array(topic[:, 1])
        # print topic_distribute
        topic_idx = topic_distribute.argsort()[:-num_show_topic-1:-1]
        print(('第%d个文档的前%d个主题：' % (i, num_show_topic)), topic_idx)
        print(topic_distribute[topic_idx])
    num_show_term = 7   # 每个主题显示几个词
    print('每个主题的词分布：')
    for topic_id in range(num_topics):
        print('主题#%d：\t' % topic_id)
        term_distribute_all = lda.get_topic_terms(topicid=topic_id)
        term_distribute = term_distribute_all[:num_show_term]
        term_distribute = np.array(term_distribute)
        term_id = term_distribute[:, 0].astype(np.int)
        print('词：\t', end=' ')
        for t in term_id:
            print(dictionary.id2token[t], end=' ')
        print()
        # print '\n概率：\t', term_distribute[:, 1]

执行输出结果这次就省却了，否则会因为内容过长无法发布@_@

路透社语料练习

# -*- coding:utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import lda
import lda.datasets
from pprint import pprint


if __name__ == "__main__":
    # document-term matrix
    X = lda.datasets.load_reuters()
    print(("type(X): {}".format(type(X))))
    print(("shape: {}\n".format(X.shape)))
    print((X[:10, :10]))

    # the vocab
    vocab = lda.datasets.load_reuters_vocab()
    print(("type(vocab): {}".format(type(vocab))))
    print(("len(vocab): {}\n".format(len(vocab))))
    print((vocab[:10]))

    # titles for each story
    titles = lda.datasets.load_reuters_titles()
    print(("type(titles): {}".format(type(titles))))
    print(("len(titles): {}\n".format(len(titles))))
    pprint(titles[:10])

    print('LDA start ----')
    topic_num = 20
    model = lda.LDA(n_topics=topic_num, n_iter=800, random_state=1)
    model.fit(X)

    # topic-word
    topic_word = model.topic_word_
    print(("type(topic_word): {}".format(type(topic_word))))
    print(("shape: {}".format(topic_word.shape)))
    print((vocab[:5]))
    print((topic_word[:, :5]))

    # Print Topic distribution
    n = 7
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n + 1):-1]
        print(('*Topic {}\n- {}'.format(i, ' '.join(topic_words))))

    # Document - topic
    doc_topic = model.doc_topic_
    print(("type(doc_topic): {}".format(type(doc_topic))))
    print(("shape: {}".format(doc_topic.shape)))
    for i in range(10):
        topic_most_pr = doc_topic[i].argmax()
        print(("文档: {} 主题: {} value: {}".format(i, topic_most_pr, doc_topic[i][topic_most_pr])))

    mpl.rcParams['font.sans-serif'] = ['SimHei']
    mpl.rcParams['axes.unicode_minus'] = False

    # Topic - word
    plt.figure(figsize=(7, 6))
    # f, ax = plt.subplots(5, 1, sharex=True)
    for i, k in enumerate([0, 5, 9, 14, 19]):
        ax = plt.subplot(5, 1, i+1)
        ax.plot(topic_word[k, :], 'r-')
        ax.set_xlim(-50, 4350)   # [0,4258]
        ax.set_ylim(0, 0.08)
        ax.set_ylabel("概率")
        ax.set_title("主题 {}".format(k))
    plt.xlabel("词", fontsize=13)
    plt.tight_layout()
    plt.suptitle('主题的词分布', fontsize=15)
    plt.subplots_adjust(top=0.9)
    plt.show()

    # Document - Topic
    plt.figure(figsize=(7, 6))
    # f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True)
    for i, k in enumerate([1, 3, 4, 8, 9]):
        ax = plt.subplot(5, 1, i+1)
        ax.stem(doc_topic[k, :], linefmt='g-', markerfmt='ro')
        ax.set_xlim(-1, topic_num+1)
        ax.set_ylim(0, 1)
        ax.set_ylabel("概率")
        ax.set_title("文档 {}".format(k))
    plt.xlabel("主题", fontsize=13)
    plt.suptitle('文档的主题分布', fontsize=15)
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

结果如图：

2019-02-19 15_46_40-Start.png

2019-02-19 15_47_03-Start.png