03 贝叶斯算法 - 案例二 - 新闻数据分类

01 贝叶斯算法 - 朴素贝叶斯
 02 贝叶斯算法 - 案例一 - 鸢尾花数据分类

常规操作

import numpy as np
from time import time
import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.datasets import fetch_20newsgroups#引入新闻数据包
from sklearn.feature_extraction.text import TfidfVectorizer#做tfidf编码
from sklearn.feature_selection import SelectKBest, chi2#卡方检验——特征筛选
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC,SVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB #引入多项式和伯努利的贝叶斯
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

## 设置属性防止中文乱码
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False

基准模型方法

def benchmark(clf,name):
    print (u'分类器：', clf)
    
    ##  设置最优参数，并使用5折交叉验证获取最优参数值
    alpha_can = np.logspace(-2, 1, 10)
    model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)
    m = alpha_can.size
    
    ## 如果模型有一个参数是alpha，进行设置
    if hasattr(clf, 'alpha'):
        model.set_params(param_grid={'alpha': alpha_can})
        m = alpha_can.size
    ## 如果模型有一个k近邻的参数，进行设置
    if hasattr(clf, 'n_neighbors'):
        neighbors_can = np.arange(1, 15)
        model.set_params(param_grid={'n_neighbors': neighbors_can})
        m = neighbors_can.size
    ## LinearSVC最优参数配置
    if hasattr(clf, 'C'):
        C_can = np.logspace(1, 3, 3)
        model.set_params(param_grid={'C':C_can})
        m = C_can.size
    ## SVM最优参数设置
    if hasattr(clf, 'C') & hasattr(clf, 'gamma'):
        C_can = np.logspace(1, 3, 3)
        gamma_can = np.logspace(-3, 0, 3)
        model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})
        m = C_can.size * gamma_can.size
    ## 设置深度相关参数，决策树
    if hasattr(clf, 'max_depth'):
        max_depth_can = np.arange(4, 10)
        model.set_params(param_grid={'max_depth': max_depth_can})
        m = max_depth_can.size
    
    ## 模型训练
    t_start = time()
    model.fit(x_train, y_train)
    t_end = time()
    t_train = (t_end - t_start) / (5*m)
    print (u'5折交叉验证的训练时间为：%.3f秒/(5*%d)=%.3f秒' % ((t_end - t_start), m, t_train))
    print (u'最优超参数为：', model.best_params_)
    
    ## 模型预测
    t_start = time()
    y_hat = model.predict(x_test)
    t_end = time()
    t_test = t_end - t_start
    print (u'测试时间：%.3f秒' % t_test)
    
    ## 模型效果评估
    train_acc = metrics.accuracy_score(y_train, model.predict(x_train))
    test_acc = metrics.accuracy_score(y_test, y_hat)
    print (u'训练集准确率：%.2f%%' % (100 * train_acc))
    print (u'测试集准确率：%.2f%%' % (100 * test_acc))
    
    ## 返回结果(训练时间耗时，预测数据耗时，训练数据错误率，测试数据错误率, 名称)
    return t_train, t_test, 1-train_acc, 1-test_acc, name

数据加载

print (u'加载数据...')
t_start = time()
## 不要头部信息
remove = ('headers', 'footers', 'quotes')
## 只要这四类数据
categories = 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'

## 分别加载训练数据和测试数据
data_train = fetch_20newsgroups(data_home='./datas/',subset='train', categories=categories, shuffle=True, random_state=0, remove=remove)
data_test = fetch_20newsgroups(data_home='./datas/',subset='test', categories=categories, shuffle=True, random_state=0, remove=remove)

## 完成
print (u"完成数据加载过程.耗时:%.3fs" % (time() - t_start))

加载数据...
完成数据加载过程.耗时:4.729s

len(data_train['data'])

2034

print(data_train.target_names)

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

获取加载数据的相关信息

def size_mb(docs):
    return sum(len(s.encode('utf-8')) for s in docs) / 1e6

categories = data_train.target_names
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print (u'数据类型：', type(data_train.data))
print("%d文本数量 - %0.3fMB (训练数据集)" % (len(data_train.data), data_train_size_mb))
print("%d文本数量 - %0.3fMB (测试数据集)" % (len(data_test.data), data_test_size_mb))
print (u'训练集和测试集使用的%d个类别的名称：' % len(categories))
print(categories)

数据类型： <class 'list'>
2034文本数量 - 2.428MB (训练数据集)
1353文本数量 - 1.800MB (测试数据集)
训练集和测试集使用的4个类别的名称：
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

数据重命名

x_train = data_train.data
y_train = data_train.target
x_test = data_test.data
y_test = data_test.target

输出前5个样本

print (u' -- 前5个文本 -- ')
for i in range(5):
    print (u'文本%d(属于类别 - %s)：' % (i+1, categories[y_train[i]]))
    print (x_train[i])
    print ('\n\n')

-- 前5个文本 --
文本1(属于类别 - alt.atheism)：
If one is a vegan (a vegetarian taht eats no animal products at at i.e eggs,
milk, cheese, etc., after about 3 years of a vegan diet, you need to start
taking B12 supplements because b12 is found only in animals.) Acutally our
bodies make B12, I think, but our bodies use up our own B12 after 2 or 3
years.
Lacto-oveo vegetarians, like myself, still get B12 through milk products
and eggs, so we don't need supplements.
And If anyone knows more, PLEASE post it. I'm nearly contridicting myself
with the mish-mash of knowledge I've gleaned.

文本2(属于类别 - comp.graphics)：
Hi,
I have a friend who is working on 2-d and 3-d object recognition. He is looking
for references describing algorithms on the following subject areas:

Thresholding
Edge Segmentation
Marr-Hildreth
Sobel Operator
Chain Codes
Thinning - Skeletonising

If anybody is willing to post an algorithm that they have implemented which demonstrates
any of the above topics, it would be much appreciated.

Please post all replies to my e-mail address. If requested I will post a summary to the
newsgroup in a couple of weeks.

Thanks in advance for all replies

文本3(属于类别 - comp.graphics)：
Hello netters

Sorry, I don't know if this is the right way of doing this kind of thing,
probably should be a CFV, but since I don't have tha ability to create a
news group myself, I just want to start the discussion.

I enjoy reading c.g very much, but I often find it difficult to sort out what
I'm interested in. Everything from screen-drivers, graphics cards, graphics
programming and graphics programs are discused here. What I'd like is a
comp.graphics.programmer news group.
What do you other think.

文本4(属于类别 - comp.graphics)：

Yes, I did punch in the wrong numbers (working too many late nites). I
intended on stating 640x400 is 256,000 bytes. It's not in the bios, just my
VESA TSR.

文本5(属于类别 - talk.religion.misc)：

Well, I am not Andy, but if you had familiarized yourself with some of
the current theories/hypotheses about abiogenesis before posting :-), you
would be aware of the fact that none of them claims that proteins were
assembled randomly from amino acids. It is current thinking that RNA-
based replicators came before proteinaceous enzymes, and that proteins
were assembled by some kind of primitive translation machinery.

Now respond to 2. :-)
--Cornelius.

文档转换为向量

转换
要求输入的数据类型必须是list，list中的每条数据，单词是以空格分割开的

vectorizer = TfidfVectorizer(input='content', stop_words='english', max_df=0.5, sublinear_tf=True)
x_train = vectorizer.fit_transform(data_train.data)  # x_train是稀疏的，scipy.sparse.csr.csr_matrix
x_test = vectorizer.transform(data_test.data)
print (u'训练集样本个数：%d，特征个数：%d' % x_train.shape)
print (u'停止词:\n')
print(vectorizer.get_stop_words())
## 获取最终的特征属性名称
feature_names = np.asarray(vectorizer.get_feature_names())

训练集样本个数：2034，特征个数：26576
停止词:
frozenset({'back', 'much', 'your', 'a', 'above', 'almost', 'those', 'indeed', 'between', 'last', 'mill', 'some', 'therein', 'they', 'who', 'could', 'too', 'nor', 'even', 'against', 'am', 'herein', 'except', 'further', 'hereupon', 'may', 'move', 'myself', 'on', 'sixty', 'empty', 'bottom', 'cry', 'eg', 'amoungst', 'find', 'take', 'otherwise', 'before', 'six', 'de', 'any', 'ltd', 'per', 'we', 'both', 'and', 'my', 'amount', 'formerly', 'over', 'seems', 'therefore', 'couldnt', 'own', 'whereby', 'beyond', 'four', 'ie', 'many', 'together', 'whole', 'whom', 'however', 'beforehand', 'sometimes', 'only', 'elsewhere', 'cant', 'rather', 'twenty', 'when', 'until', 'more', 'to', 'full', 'former', 'still', 'two', 're', 'whenever', 'yourself', 'five', 'meanwhile', 'found', 'whether', 'himself', 'three', 'none', 'how', 'onto', 'should', 'are', 'afterwards', 'another', 'since', 'them', 'enough', 'thus', 'ourselves', 'after', 'becoming', 'themselves', 'wherein', 'noone', 'me', 'as', 'fire', 'become', 'moreover', 'fill', 'for', 'no', 'he', 'etc', 'nobody', 'others', 'thereafter', 'though', 'behind', 'anywhere', 'by', 'once', 'along', 'so', 'very', 'whereupon', 'whose', 'the', 'everything', 'perhaps', 'made', 'alone', 'side', 'fifty', 'then', 'up', 'where', 'down', 'during', 'next', 'thick', 'whereas', 'itself', 'these', 'it', 'already', 'every', 'is', 'because', 'been', 'in', 'someone', 'call', 'latterly', 'un', 'can', 'besides', 'into', 'hundred', 'same', 'whatever', 'serious', 'seem', 'his', 'upon', 'forty', 'without', 'eight', 'due', 'our', 'thin', 'becomes', 'such', 'amongst', 'several', 'yet', 'hasnt', 'here', 'hence', 'most', 'an', 'across', 'least', 'interest', 'from', 'wherever', 'thereupon', 'neither', 'twelve', 'each', 'might', 'thru', 'was', 'nowhere', 'their', 'were', 'whither', 'either', 'less', 'within', 'but', 'eleven', 'yourselves', 'than', 'anything', 'of', 'sometime', 'again', 'top', 'always', 'other', 'towards', 'while', 'around', 'would', 'herself', 'if', 'nothing', 'detail', 'also', 'i', 'out', 'her', 'hers', 'ours', 'front', 'one', 'you', 'now', 'us', 'became', 'namely', 'thereby', 'about', 'done', 'somehow', 'throughout', 'thence', 'with', 'toward', 'why', 'its', 'anyway', 'somewhere', 'show', 'mostly', 'latter', 'him', 'get', 'or', 'being', 'third', 'has', 'all', 'first', 'under', 'co', 'had', 'nevertheless', 'go', 'off', 'else', 'part', 'put', 'what', 'whereafter', 'at', 'yours', 'be', 'must', 'keep', 'will', 'among', 'seemed', 'give', 'everyone', 'something', 'nine', 'everywhere', 'anyhow', 'hereby', 'via', 'do', 'have', 'few', 'seeming', 'beside', 'inc', 'hereafter', 'not', 'sincere', 'system', 'that', 'bill', 'con', 'describe', 'please', 'through', 'see', 'ten', 'this', 'fifteen', 'although', 'never', 'whence', 'whoever', 'mine', 'well', 'often', 'anyone', 'which', 'she', 'name', 'cannot', 'below', 'there', 'ever'})

特征选择

ch2 = SelectKBest(chi2, k=1000)
x_train = ch2.fit_transform(x_train, y_train)
x_test = ch2.transform(x_test)
feature_names = [feature_names[i] for i in ch2.get_support(indices=True)]

使用不同的分类器对数据进行比较

print (u'分类器的比较：\n')
clfs = [
    [RidgeClassifier(), 'Ridge'],
    [KNeighborsClassifier(), 'KNN'],
    [MultinomialNB(), 'MultinomialNB'],
    [BernoulliNB(), 'BernoulliNB'],
    [RandomForestClassifier(n_estimators=200), 'RandomForest'],
    [SVC(), 'SVM'],
    [LinearSVC(loss='squared_hinge', penalty='l1', dual=False, tol=1e-4), 'LinearSVC-l1'],
    [LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-4), 'LinearSVC-l2']
]

## 将训练数据保存到一个列表中

result = []
for clf,name in clfs:
    # 计算算法结果
    a = benchmark(clf,name)
    # 追加到一个列表中，方便进行展示操作
    result.append(a)
    print ('\n')
## 将列表转换为数组
result = np.array(result)

分类器的比较：

分类器： RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
max_iter=None, normalize=False, random_state=None, solver='auto',
tol=0.001)
5折交叉验证的训练时间为：14.476秒/(5*10)=0.290秒
最优超参数为： {'alpha': 0.46415888336127775}
测试时间：0.002秒
训练集准确率：92.63%
测试集准确率：75.83%

分类器： KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
5折交叉验证的训练时间为：9.060秒/(5*14)=0.129秒
最优超参数为： {'n_neighbors': 1}
测试时间：0.105秒
训练集准确率：96.51%
测试集准确率：50.55%

分类器： MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
5折交叉验证的训练时间为：0.328秒/(5*10)=0.007秒
最优超参数为： {'alpha': 0.01}
测试时间：0.001秒
训练集准确率：91.40%
测试集准确率：76.72%

分类器： BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
5折交叉验证的训练时间为：0.431秒/(5*10)=0.009秒
最优超参数为： {'alpha': 0.01}
测试时间：0.001秒
训练集准确率：88.64%
测试集准确率：74.28%

分类器： RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
5折交叉验证的训练时间为：25.608秒/(5*6)=0.854秒
最优超参数为： {'max_depth': 9}
测试时间：0.088秒
训练集准确率：74.93%
测试集准确率：66.89%

分类器： SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
5折交叉验证的训练时间为：38.628秒/(5*9)=0.858秒
最优超参数为： {'C': 100.0, 'gamma': 0.03162277660168379}
测试时间：0.239秒
训练集准确率：93.36%
测试集准确率：73.54%

分类器： LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l1', random_state=None, tol=0.0001,
verbose=0)
5折交叉验证的训练时间为：11.595秒/(5*3)=0.773秒
最优超参数为： {'C': 10.0}
测试时间：0.002秒
训练集准确率：96.31%
测试集准确率：72.21%

分类器： LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)
5折交叉验证的训练时间为：1.346秒/(5*3)=0.090秒
最优超参数为： {'C': 10.0}
测试时间：0.001秒
训练集准确率：95.48%
测试集准确率：74.58%

获取需要画图的数据

result = [[x[i] for x in result] for i in range(5)]
training_time, test_time, training_err, test_err, clf_names = result

training_time = np.array(training_time).astype(np.float)
test_time = np.array(test_time).astype(np.float)
training_err = np.array(training_err).astype(np.float)
test_err = np.array(test_err).astype(np.float)

画图

x = np.arange(len(training_time))
plt.figure(figsize=(10, 7), facecolor='w')
ax = plt.axes()
b0 = ax.bar(x+0.1, training_err, width=0.2, color='#77E0A0')
b1 = ax.bar(x+0.3, test_err, width=0.2, color='#8800FF')
ax2 = ax.twinx()
b2 = ax2.bar(x+0.5, training_time, width=0.2, color='#FFA0A0')
b3 = ax2.bar(x+0.7, test_time, width=0.2, color='#FF8080')
plt.xticks(x+0.5, clf_names)
plt.legend([b0[0], b1[0], b2[0], b3[0]], (u'训练集错误率', u'测试集错误率', u'训练时间', u'测试时间'), loc='upper left', shadow=True)
plt.title(u'新闻组文本数据分类及不同分类器效果比较', fontsize=18)
plt.xlabel(u'分类器名称')
plt.grid(True)
plt.tight_layout(2)
plt.show()

04 贝叶斯算法 - 贝叶斯网络

最后编辑于：2018.12.12 19:54:19

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 156,423评论 4赞 359
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 66,339评论 1赞 289
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 106,241评论 0赞 237
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,503评论 0赞 203
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 51,824评论 3赞 285
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,262评论 1赞 207
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,615评论 2赞 309
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,337评论 0赞 194
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 33,989评论 1赞 238
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,300评论 2赞 240
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,829评论 1赞 256
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,193评论 2赞 250
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,753评论 3赞 230
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 25,970评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,708评论 0赞 192
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,295评论 2赞 267
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,207评论 2赞 258