【TODO】【scikit-learn翻译】4.2.3Text feature extraction

4.2.3. Text feature extraction

4.2.3.1. The Bag of Words representation

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
文本分析是机器学习算法的主要应用领域。然而，原始数据，符号文字序列不能直接传递给算法，因为它们大多数要求具有固定长度的数字矩阵特征向量，而不是具有可变长度的原始文本文档。

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:
为解决这个问题，scikit-learn提供了从文本内容中提取数字特征的最常见方法，即：

tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
counting the occurrences of tokens in each document.
normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.
令牌化（tokenizing）对每个可能的词令牌分成字符串并赋予整数形的id，例如通过使用空格和标点符号作为令牌分隔符。
统计（counting）每个词令牌在文档中的出现次数。
标准化（normalizing）对出现在在大多数文档 / 样本中的词令牌，减少其重要程度。

In this scheme, features and samples are defined as follows:
在该方案中，特征和样本定义如下：

each individual token occurrence frequency (normalized or not) is treated as a feature.
每个单独的令牌发生频率（归一化或不归零）被视为一个特征。
the vector of all the token frequencies for a given document is considered a multivariate sample.
给定文档中所有的令牌频率向量被看做一个多元样本。

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
因此，文本的集合可被表示为矩阵形式，每行对应一条文本，每列对应每个文本中出现的词令牌(如单个词)。

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
我们称向量化是将文本文档集合转换为数字集合特征向量的普通方法。这种特殊思想（令牌化，计数和归一化）被称为 Bag of Words 或 “Bag of n-grams” 模型。文档由单词出现来描述，同时完全忽略文档中单词的相对位置信息。

4.2.3.2. Sparsity

As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
由于大多数文本文档通常只使用文本词向量全集中的一个小子集，所以得到的矩阵将具有许多特征值为零（通常大于99％）。

For instance a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
例如，10,000 个短文本文档（如电子邮件）的集合将使用总共100,000个独特词的大小的词汇，而每个文档将单独使用100到1000个独特的单词。

In order to be able to store such a matrix in memory but also to speed up algebraic operations matrix / vector, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.
为了能够将这样的矩阵存储在存储器中，并且还可以加速代数的矩阵/向量运算，实现通常将使用诸如 scipy.sparse 包中的稀疏实现。

4.2.3.3. Common Vectorizer usage

CountVectorizer implements both tokenization and occurrence counting in a single class:
类 CountVectorizer 在单个类中实现了 tokenization （词语切分）和 occurrence counting （出现频数统计）:

from sklearn.feature_extraction.text import CountVectorizer

This model has many parameters, however the default values are quite reasonable (please see the reference documentation for the details):
这个模型有很多参数，但参数的默认初始值是相当合理的（请参阅参考文档了解详细信息）:

>>> vectorizer = CountVectorizer()
>>> vectorizer                     
CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict',
        dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
我们用它来对简约的文本语料库进行 tokenize（分词）和统计单词出现频数:

>>> corpus = [
...     'This is the first document.',
...     'This is the second second document.',
...     'And the third one.',
...     'Is this the first document?',
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 sparse matrix of type '<... 'numpy.int64'>'
    with 19 stored elements in Compressed Sparse ... format>

The default configuration tokenizes the string by extracting words of at least 2 letters. The specific function that does this step can be requested explicitly:
默认配置通过提取至少 2 个字母的单词来对 string 进行分词。做这一步的函数可以显式地被调用:

>>> analyze = vectorizer.build_analyzer()
>>> analyze("This is a text document to analyze.") == (
...     ['this', 'is', 'text', 'document', 'to', 'analyze'])
True

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
analyzer 在拟合过程中找到的每个 term（项）都会被分配一个唯一的整数索引，对应于 resulting matrix（结果矩阵）中的一列。此列的一些说明可以被检索如下:

>>> vectorizer.get_feature_names() == (
...     ['and', 'document', 'first', 'is', 'one',
...      'second', 'the', 'third', 'this'])
True

>>> X.toarray()           
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

The converse mapping from feature name to column index is stored in the vocabulary_ attribute of the vectorizer:
从 feature 名称到 column index（列索引）的逆映射存储在 vocabulary_ 属性中:

>>> vectorizer.vocabulary_.get('document')
1

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:
因此，在未来对 transform 方法的调用中，在 training corpus （训练语料库）中没有看到的单词将被完全忽略:

>>> vectorizer.transform(['Something completely new.']).toarray()
...                           
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)

Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. In particular we lose the information that the last document is an interrogative form. To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):
请注意，在前面的 corpus（语料库）中，第一个和最后一个文档具有完全相同的词，因为被编码成相同的向量。特别是我们丢失了最后一个文件是一个疑问的形式的信息。为了防止词组顺序颠倒，除了提取一元模型 1-grams（个别词）之外，我们还可以提取 2-grams 的单词:

>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
...                                     token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
...     ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True

The vocabulary extracted by this vectorizer is hence much bigger and can now resolve ambiguities encoded in local positioning patterns:
由 vectorizer（向量化器）提取的 vocabulary（词汇）因此会变得更大，同时可以在定位模式时消除歧义:

>>> X_2 = bigram_vectorizer.fit_transform(corpus).toarray()
>>> X_2
...                           
array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)

In particular the interrogative form “Is this” is only present in the last document:
特别是 “Is this” 的疑问形式只出现在最后一个文档中:

>>> feature_index = bigram_vectorizer.vocabulary_.get('is this')
>>> X_2[:, feature_index]     
array([0, 0, 0, 1]...)

4.2.3.4. Tf–idf term weighting

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
在一个大的文本语料库中，一些单词将出现很多次（例如 “the”, “a”, “is” 是英文），因此对文档的实际内容没有什么有意义的信息。如果我们将直接计数数据直接提供给分类器，那么这些频繁词组会掩盖住那些我们关注但很少出现的词。

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
为了为了重新计算特征权重，并将其转化为适合分类器使用的浮点值，因此使用 tf-idf 变换是非常常见的。

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency: $\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}$

Using the TfidfTransformer’s default settings,TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) the term frequency, the number of times a term occurs in a given document, is multiplied with idf component, which is computed as
$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$ ,
where $n_d$ is the total number of documents, and $\text{df}(d,t)$
is the number of documents that contain term $t$ . The resulting tf-idf vectors are then normalized by the Euclidean norm:
$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$ .
Tf表示词频，而 tf-idf 表示术语频率乘以逆文档频率: $\text{tf-idf(t,d)}=\text{tf(t,d)} \times \text{idf(t)}$
使用 TfidfTransformer 的默认设置，TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) 词频即一个词在给定文档中出现的次数，乘以 idf 即通过 $\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$ 计算,
其中 $n_d$ 是文档的总数， $\text{df}(d,t)$ 是包含词 $t$ 的文档数。然后，所得到的tf-idf向量通过欧几里得范数归一化：
$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$ .

This was originally a term weighting scheme developed for information retrieval (as a ranking function for search engines results) that has also found good use in document classification and clustering.

The following sections contain further explanations and examples that illustrate how the tf-idfs are computed exactly and how the tf-idfs computed in scikit-learn’s TfidfTransformer and TfidfVectorizer differ slightly from the standard textbook notation that defines the idf as

\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.

In the TfidfTransformer and TfidfVectorizer with smooth_idf=False, the “1” count is added to the idf instead of the idf’s denominator:

\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1

它源于一个词权重的信息检索方式(作为搜索引擎结果的评级函数)，同时也在文档分类和聚类中表现良好。

以下部分包含进一步说明和示例，说明如何精确计算 tf-idfs 以及如何在 scikit-learn 中计算 tf-idfs， TfidfTransformer 并 TfidfVectorizer 与定义 idf 的标准教科书符号略有不同

\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.

在 TfidfTransformer 和 TfidfVectorizer 中 smooth_idf=False，将 “1” 计数添加到 idf 而不是 idf 的分母:

\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1

This normalization is implemented by the TfidfTransformer class:
该归一化由类 TfidfTransformer 实现:

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer   
TfidfTransformer(norm=...'l2', smooth_idf=False, sublinear_tf=False,
                 use_idf=True)

Again please see the reference documentation for the details on all the parameters.
有关所有参数的详细信息，请参阅参考文档。

Let’s take an example with the following counts. The first term is present 100% of the time hence not very interesting. The two other features only in less than 50% of the time hence probably more representative of the content of the documents:
让我们以下方的词频为例。第一个次在任何时间都是100％出现，因此不是很有重要。另外两个特征只占不到50％的比例，因此可能更具有代表性:

>>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type '<... 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.81940995,  0.        ,  0.57320793],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.47330339,  0.88089948,  0.        ],
       [ 0.58149261,  0.        ,  0.81355169]])

Each row is normalized to have unit Euclidean norm:

v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}

For example, we can compute the tf-idf of the first term in the first document in the <cite style="font-style: normal;">counts</cite> array as follows:

n_{d, {\text{term1}}} = 6

\text{df}(d, t)_{\text{term1}} = 6

\text{idf}(d, t)_{\text{term1}} = log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1

\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3

Now, if we repeat this computation for the remaining 2 terms in the document, we get

\text{tf-idf}_{\text{term2}} = 0 \times (log(6/1)+1) = 0

\text{tf-idf}_{\text{term3}} = 1 \times (log(6/2)+1) \approx 2.0986

and the vector of raw tf-idfs:

\text{tf-idf}_{\text{raw}} = [3, 0, 2.0986].

Then, applying the Euclidean (L2) norm, we obtain the following tf-idfs for document 1:

\frac{[3, 0, 2.0986]}{\sqrt{\big(3^2 + 0^2 + 2.0986^2\big)}} = [ 0.819, 0, 0.573].

Furthermore, the default parameter smooth_idf=True adds “1” to the numerator and denominator as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:

\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1

Using this modification, the tf-idf of the third term in document 1 changes to 1.8473:

\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473

And the L2-normalized tf-idf changes to

\frac{[3, 0, 1.8473]}{\sqrt{\big(3^2 + 0^2 + 1.8473^2\big)}} = [0.8515, 0, 0.5243]

:
每行都被正则化，使其适应欧几里得标准:

v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}

例如，我们可以计算计数数组中第一个文档中第一个项的 tf-idf ，如下所示:

n_{d, {\text{term1}}} = 6

\text{df}(d, t)_{\text{term1}} = 6

\text{idf}(d, t)_{\text{term1}} = log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1

\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3

现在，如果我们对文档中剩下的2个术语重复这个计算，我们得到:

\text{tf-idf}_{\text{term2}} = 0 \times log(6/1)+1 = 0

\text{tf-idf}_{\text{term3}} = 1 \times log(6/2)+1 \approx 2.0986

和原始 tf-idfs 的向量:

\text{tf-idf}_raw = [3, 0, 2.0986].

然后，应用欧几里德（L2）规范，我们获得文档1的以下 tf-idfs:

\frac{[3, 0, 2.0986]}{\sqrt{\big(3^2 + 0^2 + 2.0986^2\big)}} = [ 0.819, 0, 0.573].

此外，默认参数 smooth_idf=True 将 “1” 添加到分子和分母，就好像一个额外的文档被看到一样包含集合中的每个术语，这样可以避免零分割:

\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1

使用此修改，文档1中第三项的 tf-idf 更改为 1.8473:

\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473

而 L2 标准化的 tf-idf 变为

\frac{[3, 0, 1.8473]}{\sqrt{\big(3^2 + 0^2 + 1.8473^2\big)}} = [0.8515, 0, 0.5243]

>>> transformer = TfidfTransformer()
>>> transformer.fit_transform(counts).toarray()
array([[ 0.85151335,  0.        ,  0.52433293],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.55422893,  0.83236428,  0.        ],
       [ 0.63035731,  0.        ,  0.77630514]])

The weights of each feature computed by the fit method call are stored in a model attribute:
通过 fit 方法调用计算出的每个特征的权重存储在模型属性中:

>>> transformer.idf_                       
array([ 1. ...,  2.25...,  1.84...])

As tf–idf is very often used for text features, there is also another class called TfidfVectorizer that combines all the options of CountVectorizer and TfidfTransformer in a single model:
由于 tf-idf 经常用于文本特征，所以还有一个类 TfidfVectorizer ，它将 CountVectorizer 和 TfidfTransformer 的所有选项组合在一个单例模型中:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
...                                
<4x9 sparse matrix of type '<... 'numpy.float64'>'
    with 19 stored elements in Compressed Sparse ... format>

While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable.
虽然tf-idf标准化通常非常有用，但是可能有一种情况是二元变量显示会提供更好的特征。这可以使用类 CountVectorizer的 二进制 参数来实现。特别地，一些估计器，诸如伯努利朴素贝叶斯显式的使用离散的布尔随机变量。而且，非常短的文本很可能影响 tf-idf 值，而二进制出现信息更稳定。

As usual the best way to adjust the feature extraction parameters is to use a cross-validated grid search, for instance by pipelining the feature extractor with a classifier:

Sample pipeline for text feature extraction and evaluation

通常情况下，调整特征提取参数的最佳方法是使用基于网格搜索的交叉验证，例如通过将特征提取器与分类器进行流水线化:

用于文本特征提取和评估的样本管道 Sample pipeline for text feature extraction and evaluation

4.2.3.5. Decoding text files

Text is made of characters, but files are made of bytes. These bytes represent characters according to some encoding. To work with text files in Python, their bytes must be decoded to a character set called Unicode. Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian) and the universal encodings UTF-8 and UTF-16. Many others exist.

Note

An encoding can also be called a ‘character set’, but this term is less accurate: several encodings can exist for a single character set.

The text feature extractors in scikit-learn know how to decode text files, but only if you tell them what encoding the files are in. The CountVectorizer takes an encoding parameter for this purpose. For modern text files, the correct encoding is probably UTF-8, which is therefore the default (encoding="utf-8").

If the text you are loading is not actually encoded with UTF-8, however, you will get a UnicodeDecodeError. The vectorizers can be told to be silent about decoding errors by setting the decode_error parameter to either "ignore" or "replace". See the documentation for the Python function bytes.decode for more details (type help(bytes.decode)at the Python prompt).

If you are having trouble decoding text, here are some things to try:

Find out what the actual encoding of the text is. The file might come with a header or README that tells you the encoding, or there might be some standard encoding you can assume based on where the text comes from.
You may be able to find out what kind of encoding it is in general using the UNIX command file. The Python chardet module comes with a script called chardetect.py that will guess the specific encoding, though you cannot rely on its guess being correct.
You could try UTF-8 and disregard the errors. You can decode byte strings with bytes.decode(errors='replace') to replace all decoding errors with a meaningless character, or set decode_error='replace' in the vectorizer. This may damage the usefulness of your features.
Real text may come from a variety of sources that may have used different encodings, or even be sloppily decoded in a different encoding than the one it was encoded with. This is common in text retrieved from the Web. The Python package ftfy can automatically sort out some classes of decoding errors, so you could try decoding the unknown text as latin-1 and then using ftfy to fix errors.
If the text is in a mish-mash of encodings that is simply too hard to sort out (which is the case for the 20 Newsgroups dataset), you can fall back on a simple single-byte encoding such as latin-1. Some text may display incorrectly, but at least the same sequence of bytes will always represent the same feature.

For example, the following snippet uses chardet (not shipped with scikit-learn, must be installed separately) to figure out the encoding of three texts. It then vectorizes the texts and prints the learned vocabulary. The output is not shown here.

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> import chardet

text1 = b"Sei mir gegr\xc3\xbc\xc3\x9ft mein Sauerkraut"
text2 = b"holdselig sind deine Ger\xfcche"
text3 = b"\xff\xfeA\x00u\x00f\x00 \x00F\x00l\x00\xfc\x00g\x00e\x00l\x00n\x00 \x00d\x00e\x00s\x00 \x00G\x00e\x00s\x00a\x00n\x00g\x00e\x00s\x00,\x00 \x00H\x00e\x00r\x00z\x00l\x00i\x00e\x00b\x00c\x00h\x00e\x00n\x00,\x00 \x00t\x00r\x00a\x00g\x00 \x00i\x00c\x00h\x00 \x00d\x00i\x00c\x00h\x00 \x00f\x00o\x00r\x00t\x00"
decoded = [x.decode(chardet.detect(x)['encoding'])
... for x in (text1, text2, text3)]
v = CountVectorizer().fit(decoded).vocabulary_
for term in v: print(v)
</pre>

(Depending on the version of chardet, it might get the first one wrong.)

For an introduction to Unicode and character encodings in general, see Joel Spolsky’s Absolute Minimum Every Software Developer Must Know About Unicode.

4.2.3.6. Applications and examples

The bag of words representation is quite simplistic but surprisingly useful in practice.

In particular in a supervised setting it can be successfully combined with fast and scalable linear models to train document classifiers, for instance:

Classification of text documents using sparse features

In an unsupervised setting it can be used to group similar documents together by applying clustering algorithms such as K-means:

Clustering text documents using k-means

Finally it is possible to discover the main topics of a corpus by relaxing the hard assignment constraint of clustering, for instance by using Non-negative matrix factorization (NMF or NNMF):

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

4.2.3.7. Limitations of the Bag of Words representation

A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.

N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.

One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and derivations.

For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']. The second document contains a misspelling of the word ‘words’. A simple bag of words representation would consider these two as very distinct documents, differing in both of the two possible features. A character 2-gram representation, however, would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:

counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
</pre>

In the above example, 'char_wb analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char' analyzer, alternatively, creates n-grams that span across words:

ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True

ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
ngram_vectorizer.fit_transform(['jumpy fox'])

ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
</pre>

In the above example, 'char_wb analyzer is used, which creates n-grams only from characters inside word boundaries (padded with space on each side). The 'char' analyzer, alternatively, creates n-grams that span across words:

>>>

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x4 sparse matrix of type '<... 'numpy.int64'>'
with 4 stored elements in Compressed Sparse ... format>
>>> ngram_vectorizer.get_feature_names() == (
... [' fox ', ' jump', 'jumpy', 'umpy '])
True

>>> ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(5, 5))
>>> ngram_vectorizer.fit_transform(['jumpy fox'])
...
<1x5 sparse matrix of type '<... 'numpy.int64'>'
with 5 stored elements in Compressed Sparse ... format>
ngram_vectorizer.get_feature_names() == (
... ['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'])
True
</pre>

The word boundaries-aware variant char_wb is especially interesting for languages that use white-spaces for word separation as it generates significantly less noisy features than the raw char variant in that case. For such languages it can increase both the predictive accuracy and convergence speed of classifiers trained using such features while retaining the robustness with regards to misspellings and word derivations.

While some local positioning information can be preserved by extracting n-grams instead of individual words, bag of words and bag of n-grams destroy most of the inner structure of the document and hence most of the meaning carried by that internal structure.

In order to address the wider task of Natural Language Understanding, the local structure of sentences and paragraphs should thus be taken into account. Many such models will thus be casted as “Structured output” problems which are currently outside of the scope of scikit-learn.

4.2.3.8. Vectorizing a large text corpus with the hashing trick

The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:

the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),
it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.

It is possible to overcome those limitations by combining the “hashing trick” (Feature hashing) implemented by thesklearn.feature_extraction.FeatureHasher class and the text preprocessing and tokenization features of the CountVectorizer.

This combination is implementing in HashingVectorizer, a transformer class that is mostly API compatible with CountVectorizer. HashingVectorizer is stateless, meaning that you don’t have to call fit on it:

hv = HashingVectorizer(n_features=10)
hv.transform(corpus)
...
<4x10 sparse matrix of type '<... 'numpy.float64'>'
with 16 stored elements in Compressed Sparse ... format>
</pre>

You can see that 16 non-zero feature tokens were extracted in the vector output: this is less than the 19 non-zeros extracted previously by the CountVectorizer on the same toy corpus. The discrepancy comes from hash function collisions because of the low value of the n_features parameter.

In a real world setting, the n_features parameter can be left to its default value of 2 ** 20 (roughly one million possible features). If memory or downstream models size is an issue selecting a lower value such as 2 ** 18 might help without introducing too many additional collisions on typical text classification tasks.

Note that the dimensionality does not affect the CPU training time of algorithms which operate on CSR matrices (LinearSVC(dual=True), Perceptron, SGDClassifier, PassiveAggressive) but it does for algorithms that work with CSC matrices (LinearSVC(dual=False), Lasso(), etc).

Let’s try again with the default setting:

hv.transform(corpus)
...
<4x1048576 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
</pre>

We no longer get the collisions, but this comes at the expense of a much larger dimensionality of the output space. Of course, other terms than the 19 used here might still collide with each other.

The HashingVectorizer also comes with the following limitations:

it is not possible to invert the model (no inverse_transform method), nor to access the original string representation of the features, because of the one-way nature of the hash function that performs the mapping.
it does not provide IDF weighting as that would introduce statefulness in the model. A TfidfTransformer can be appended to it in a pipeline if required.

4.2.3.9. Performing out-of-core scaling with HashingVectorizer

An interesting development of using a HashingVectorizer is the ability to perform out-of-core scaling. This means that we can learn from data that does not fit into the computer’s main memory.

A strategy to implement out-of-core scaling is to stream data to the estimator in mini-batches. Each mini-batch is vectorized using HashingVectorizer so as to guarantee that the input space of the estimator has always the same dimensionality. The amount of memory used at any time is thus bounded by the size of a mini-batch. Although there is no limit to the amount of data that can be ingested using such an approach, from a practical point of view the learning time is often limited by the CPU time one wants to spend on the task.

For a full-fledged example of out-of-core scaling in a text classification task see Out-of-core classification of text documents.

4.2.3.10. Customizing the vectorizer classes

It is possible to customize the behavior by passing a callable to the vectorizer constructor:

vectorizer = CountVectorizer(tokenizer=my_tokenizer)
vectorizer.build_analyzer()(u"Some... punctuation!") == (
... ['some...', 'punctuation!'])
True
</pre>

In particular we name:

preprocessor: a callable that takes an entire document as input (as a single string), and returns a possibly transformed version of the document, still as an entire string. This can be used to remove HTML tags, lowercase the entire document, etc.

tokenizer: a callable that takes the output from the preprocessor and splits it into tokens, then returns a list of these.

analyzer: a callable that replaces the preprocessor and tokenizer. The default analyzers all call the preprocessor and tokenizer, but custom analyzers will skip this. N-gram extraction and stop word filtering take place at the analyzer level, so a custom analyzer may have to reproduce these steps.

(Lucene users might recognize these names, but be aware that scikit-learn concepts may not map one-to-one onto Lucene concepts.)

To make the preprocessor, tokenizer and analyzers aware of the model parameters it is possible to derive from the class and override the build_preprocessor, build_tokenizer`` andbuild_analyzer` factory methods instead of passing custom functions.

Some tips and tricks:

If documents are pre-tokenized by an external package, then store them in files (or strings) with the tokens separated by whitespace and pass analyzer=str.split

Fancy token-level analysis such as stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer. Here’s a CountVectorizer with a tokenizer and lemmatizer using NLTK:

>>>

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> from nltk import word_tokenize          
>>> from nltk.stem import WordNetLemmatizer 
>>> class LemmaTokenizer(object):
...     def __init__(self):
...         self.wnl = WordNetLemmatizer()
...     def __call__(self, doc):
...         return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]
...
>>> vect = CountVectorizer(tokenizer=LemmaTokenizer())  
</pre>







(Note that this will not filter out punctuation.)



The following example will, for instance, transform some British spelling to American spelling:





>>>

<pre style="padding: 5px 10px; font-family: Monaco, Menlo, Consolas, &quot;Courier New&quot;, monospace; font-size: 13px; color: rgb(34, 34, 34); border-radius: 4px; display: block; margin: 0.1em 0px 0.5em; line-height: 1.2em; word-break: break-all; word-wrap: break-word; white-space: pre-wrap; background-color: rgb(248, 248, 248); border: 1px solid rgb(221, 221, 221); overflow: auto hidden;">>>> import re
>>> def to_british(tokens):
...     for t in tokens:
...         t = re.sub(r"(...)our$", r"\1or", t)
...         t = re.sub(r"([bt])re$", r"\1er", t)
...         t = re.sub(r"([iy])s(e$|ing|ation)", r"\1z\2", t)
...         t = re.sub(r"ogue$", "og", t)
...         yield t
...
>>> class CustomVectorizer(CountVectorizer):
...     def build_tokenizer(self):
...         tokenize = super(CustomVectorizer, self).build_tokenizer()
...         return lambda doc: list(to_british(tokenize(doc)))
...
>>> print(CustomVectorizer().build_analyzer()(u"color colour")) 
[...'color', ...'color']
</pre>







for other styles of preprocessing; examples include stemming, lemmatization, or normalizing numerical tokens, with the latter illustrated in:



> *   [Biclustering documents with the Spectral Co-clustering algorithm](http://scikit-learn.org/stable/auto_examples/bicluster/plot_bicluster_newsgroups.html#sphx-glr-auto-examples-bicluster-plot-bicluster-newsgroups-py)

Customizing the vectorizer can also be useful when handling Asian languages that do not use an explicit word separator such as whitespace.

参考资料：

最后编辑于：2018.09.02 10:40:02

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,847评论 4赞 362
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,208评论 1赞 292
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,587评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,942评论 0赞 205
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,332评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,587评论 1赞 218
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,853评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,568评论 0赞 198
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,273评论 1赞 242
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,542评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,033评论 1赞 260
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,373评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,031评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,073评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,830评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,628评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,537评论 2赞 269