基于sklearn的朴素贝叶斯分类器

理论内容

贝叶斯定理

贝叶斯定理是描述条件概率关系的定律
$$P(A|B) = \cfrac{P(B|A) * P(A)}{P(B)}$$

朴素贝叶斯分类器

朴素贝叶斯分类器是一种基于概率的分类器,我们做以下定义:

  • B:具有特征向量B
  • A:属于类别A

有了这个定义,我们解释贝叶斯公式

  • P(A|B):具有特征向量B样本属于A类别的概率(计算目标)
  • P(B|A):在A类别中B向量出现的概率(训练样本中的数据)
  • P(A):A类出现的概率(训练样本中的频率)
  • P(B):B特征向量出现的概率(训练样本中的频率)

对于朴素贝叶斯分类器,进一步假设特征向量之间无关,那么朴素贝叶斯分类器公式可以如下表示$$P(A|B) = \cfrac{P(A)\prod P(B_{i} |A)}{P(B)}$$

以上公式右侧的值都可以在训练样本中算得。进行预测时,分别计算每个类别的概率,取概率最高的一个类别。

特征向量为连续值的朴素贝叶斯分类器

对于连续值,有以下两种处理方式

  • 将连续值按区间离散化
  • 假设特征向量服从正态分布或其他分布(很强的先验假设),由样本中估计出参数,计算贝叶斯公式时带入概率密度

代码实现

导入数据——文本新闻数据

# from sklearn.datasets import fetch_20newsgroups
# news = fetch_20newsgroups(subset='all')
# print(len(news.data))
# print(news.data[0])
from sklearn import datasets
train = datasets.load_files("./20newsbydate/20news-bydate-train")
test = datasets.load_files("./20newsbydate/20news-bydate-test")
print(train.DESCR)
print(len(train.data))
print(train.data[0])
None
11314
b"From: cubbie@garnet.berkeley.edu (                               )\nSubject: Re: Cubs behind Marlins? How?\nArticle-I.D.: agate.1pt592$f9a\nOrganization: University of California, Berkeley\nLines: 12\nNNTP-Posting-Host: garnet.berkeley.edu\n\n\ngajarsky@pilot.njin.net writes:\n\nmorgan and guzman will have era's 1 run higher than last year, and\n the cubs will be idiots and not pitch harkey as much as hibbard.\n castillo won't be good (i think he's a stud pitcher)\n\n       This season so far, Morgan and Guzman helped to lead the Cubs\n       at top in ERA, even better than THE rotation at Atlanta.\n       Cubs ERA at 0.056 while Braves at 0.059. We know it is early\n       in the season, we Cubs fans have learned how to enjoy the\n       short triumph while it is still there.\n"

处理数据——特征抽取(文字向量化)

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words="english",decode_error='ignore')
train_vec = vec.fit_transform(train.data)
test_vec = vec.transform(test.data)
print(train_vec.shape)
(11314, 129782)

模型训练

from sklearn.naive_bayes import MultinomialNB
bays = MultinomialNB()
bays.fit(train_vec,train.target)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

模型评估

使用自带评估器

bays.score(test_vec,test.target)
0.80244291024960168

调用评估器

from sklearn.metrics import classification_report
y = bays.predict(test_vec)
print(classification_report(test.target,y,target_names=test.target_names))
                          precision    recall  f1-score   support

             alt.atheism       0.80      0.81      0.80       319
           comp.graphics       0.65      0.80      0.72       389
 comp.os.ms-windows.misc       0.80      0.04      0.08       394
comp.sys.ibm.pc.hardware       0.55      0.80      0.65       392
   comp.sys.mac.hardware       0.85      0.79      0.82       385
          comp.windows.x       0.69      0.84      0.76       395
            misc.forsale       0.89      0.74      0.81       390
               rec.autos       0.89      0.92      0.91       396
         rec.motorcycles       0.95      0.94      0.95       398
      rec.sport.baseball       0.95      0.92      0.93       397
        rec.sport.hockey       0.92      0.97      0.94       399
               sci.crypt       0.80      0.96      0.87       396
         sci.electronics       0.79      0.70      0.74       393
                 sci.med       0.88      0.87      0.87       396
               sci.space       0.84      0.92      0.88       394
  soc.religion.christian       0.81      0.95      0.87       398
      talk.politics.guns       0.72      0.93      0.81       364
   talk.politics.mideast       0.93      0.94      0.94       376
      talk.politics.misc       0.68      0.62      0.65       310
      talk.religion.misc       0.88      0.44      0.59       251

             avg / total       0.81      0.80      0.78      7532

推荐阅读更多精彩内容