# 理论内容

## 贝叶斯定理

\$\$P(A|B) = \cfrac{P(B|A) * P(A)}{P(B)}\$\$

## 朴素贝叶斯分类器

• B：具有特征向量B
• A：属于类别A

• P(A|B)：具有特征向量B样本属于A类别的概率（计算目标）
• P(B|A)：在A类别中B向量出现的概率（训练样本中的数据）
• P(A)：A类出现的概率（训练样本中的频率）
• P(B)：B特征向量出现的概率（训练样本中的频率）

## 特征向量为连续值的朴素贝叶斯分类器

• 将连续值按区间离散化
• 假设特征向量服从正态分布或其他分布（很强的先验假设），由样本中估计出参数，计算贝叶斯公式时带入概率密度

# 代码实现

## 导入数据——文本新闻数据

``````# from sklearn.datasets import fetch_20newsgroups
# news = fetch_20newsgroups(subset='all')
# print(len(news.data))
# print(news.data[0])
``````
``````from sklearn import datasets
``````
``````print(train.DESCR)
print(len(train.data))
print(train.data[0])
``````
``````None
11314
b"From: cubbie@garnet.berkeley.edu (                               )\nSubject: Re: Cubs behind Marlins? How?\nArticle-I.D.: agate.1pt592\$f9a\nOrganization: University of California, Berkeley\nLines: 12\nNNTP-Posting-Host: garnet.berkeley.edu\n\n\ngajarsky@pilot.njin.net writes:\n\nmorgan and guzman will have era's 1 run higher than last year, and\n the cubs will be idiots and not pitch harkey as much as hibbard.\n castillo won't be good (i think he's a stud pitcher)\n\n       This season so far, Morgan and Guzman helped to lead the Cubs\n       at top in ERA, even better than THE rotation at Atlanta.\n       Cubs ERA at 0.056 while Braves at 0.059. We know it is early\n       in the season, we Cubs fans have learned how to enjoy the\n       short triumph while it is still there.\n"
``````

# 处理数据——特征抽取（文字向量化）

``````from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words="english",decode_error='ignore')
train_vec = vec.fit_transform(train.data)
test_vec = vec.transform(test.data)
print(train_vec.shape)
``````
``````(11314, 129782)
``````

## 模型训练

``````from sklearn.naive_bayes import MultinomialNB
bays = MultinomialNB()
bays.fit(train_vec,train.target)
``````
``````MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
``````

# 模型评估

## 使用自带评估器

``````bays.score(test_vec,test.target)
``````
``````0.80244291024960168
``````

## 调用评估器

``````from sklearn.metrics import classification_report
y = bays.predict(test_vec)
print(classification_report(test.target,y,target_names=test.target_names))
``````
``````                          precision    recall  f1-score   support

alt.atheism       0.80      0.81      0.80       319
comp.graphics       0.65      0.80      0.72       389
comp.os.ms-windows.misc       0.80      0.04      0.08       394
comp.sys.ibm.pc.hardware       0.55      0.80      0.65       392
comp.sys.mac.hardware       0.85      0.79      0.82       385
comp.windows.x       0.69      0.84      0.76       395
misc.forsale       0.89      0.74      0.81       390
rec.autos       0.89      0.92      0.91       396
rec.motorcycles       0.95      0.94      0.95       398
rec.sport.baseball       0.95      0.92      0.93       397
rec.sport.hockey       0.92      0.97      0.94       399
sci.crypt       0.80      0.96      0.87       396
sci.electronics       0.79      0.70      0.74       393
sci.med       0.88      0.87      0.87       396
sci.space       0.84      0.92      0.88       394
soc.religion.christian       0.81      0.95      0.87       398
talk.politics.guns       0.72      0.93      0.81       364
talk.politics.mideast       0.93      0.94      0.94       376
talk.politics.misc       0.68      0.62      0.65       310
talk.religion.misc       0.88      0.44      0.59       251

avg / total       0.81      0.80      0.78      7532
``````