《Deep Learning with Python》第六章 6.1 深度学习之文本处理

6.1 深度学习之文本处理

文本是序列数据传播最广泛的形式之一，它可以理解成一个字母序列或者词序列，但是最常见的形式是词序列。后面章节介绍的深度学习序列处理模型有文档分类、情感分析、作者识别和限制语境问答（QA）。当然了，要记住的是：这些深度学习模型并不是真正意义上以人的思维去理解文字，而只是书面语的统计结构映射而已。基于深度学习的自然语言处理可以看作对字词、句子和段落的模式识别，这有点像计算机视觉中对像素的模式识别。

跟其它所有神经网络一样，深度学习模型并不是以原始文本为输入，而是数值型张量。向量化文本是将文本转换成数值张量的过程。有以下几种方式可以做向量化文本：

将文本分割为词，转换每个词为向量；
将文本分割为字（字母），转换每个字为向量；
抽取词或者字的n-gram，转换每个n-gram转换为向量。n-gram是多个连续词或者字的元组。

将文本分割为字、词或者n-gram的过程称为分词（tokenization），拆分出来的字、词或者n-gram称为token。所有文本向量化的过程都包含分词和token转换为数值型向量。这些向量封装成序列张量“喂入”神经网络模型。有多种方式可以将token转换为数值向量，但是本小节介绍两种方法：one-hot编码和词嵌入。

image

图6.1 文本向量化过程

n-gram和词袋的理解

n-gram是指从句子中抽取的N个连续词的组合。对于字也有相同的概念。

下面是一个简单的例子。句子“the cat sat on the mat”拆分成2-gram的集合如下：

{"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"}

拆分成3-gram的集合如下：

{"The", "The cat", "cat", "cat sat", "The cat sat", "sat", "sat on", "on", "cat sat on", "on the", "the", "sat on the", "the mat", "mat", "on the mat"}

上面这些集合相应地称为2-gram的词袋，3-gram的词袋。术语词袋（bag）是指token的集合，而不是一个列表或者序列：token是无序的。所有分词方法的结果统称为词袋。

词袋是一个无序的分词方法，其丢失了文本序列的结构信息。词袋模型用于浅语言处理模型中，而不是深度学习模型。抽取n-gram是一种特征工程，但是深度学习是用一种简单粗暴的方法做特征工程，去代替复杂的特征工程。本章后面会讲述一维卷积和RNN，它们能从字、词的组合中学习表征。所以本书不再进一步展开介绍n-gram。但是记住，在轻量级模型或者浅文本处理模型（逻辑回归和随机森林）中，n-gram是一个强有力、不可替代的特征工程工具。

6.1.1 字词的one-hot编码

one-hot编码是最常见、最基本的文本向量化方法。在前面第三章的IMDB和Reuter例子中有使用过。one-hot编码中每个词有唯一的数值索引，然后将对应的索引转成大小为N的二值向量（N为字典的大小）：词所对应的索引位置的值为1，其它索引对应的值为0。

当然，字级别也可以做one-hot编码。为了予以区分，列表6.1和6.2分别展示词和字的one-hot编码。

#Listing 6.1 Word-level one-hot encoding

import numpy as np

'''
Initial data: one entry per sample (in this example,
a sample is a sentence,
but it could be an entire document)
'''
samples = ['The cat sat on the mat.', 'The dog ate my homework.']

'''
 Builds an index of all tokens in the data
 '''
token_index = {}
for sample in samples:
    '''
    Tokenizes the samples via the split method.
    In real life, you’d also strip punctuation
    and special characters from the samples.
    '''
    for word in sample.split():
        if word not in token_index:
            '''
            Assigns a unique index to each unique word.
            Note that you don’t attribute index 0 to anything.
            '''
            token_index[word] = len(token_index) + 1

'''
Vectorizes the samples. You’ll only consider
the first max_length words in each sample.
'''
max_length = 10
'''
This is where you store the results.
'''
results = np.zeros(shape=(len(samples),
                          max_length,
                          max(token_index.values()) + 1))

for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

#Listing 6.2 Character-level one-hot encoding

import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
'''
All printable ASCII characters
'''
characters = string.printable
token_index = dict(zip(range(1, len(characters) + 1), characters))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample):
        index = token_index.get(character)
        results[i, j, index] = 1.

Keras有内建工具处理文本的one-hot编码。建议你使用这些工具，因为它们有不少功能，比如，删除指定字符，考虑数据集中最常用的N个字（严格来讲，是避免向量空间过大）。

#Listing 6.3 Using Keras for word-level one-hot encoding

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

'''
Creates a tokenizer, configured to only take into account the 1,000 most common words
'''
tokenizer = Tokenizer(num_words=1000)
'''
Builds the word index
'''
tokenizer.fit_on_texts(samples)

'''
Turns strings into lists of integer indices
'''
sequences = tokenizer.texts_to_sequences(samples)

'''
You could also directly get the one-hot binary representations. Vectorization modes other than one-hot encoding are supported by this tokeniser.
'''
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

'''
How you can recover the word index that was computed
'''
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

one-hot 哈希（hash）编码是one-hot编码的一个变种，它主要用在字典太大难以处理的情况。one-hot 哈希编码是将词通过轻量级的哈希算法打散成固定长度的向量，而不是像one-hot编码将每个词分配给一个索引。one-hot 哈希编码最大的优势是节省内存和数据的在线编码。同时这种方法的一个缺点是碰到哈希碰撞冲突（hash collision），也就是两个不同词的哈希值相同，导致机器学习模型不能分辨这些词。哈希碰撞冲突的可能性会随着哈希空间的维度越大而减小。

#Listing 6.4 Word-level one-hot encoding with hashing trick

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

'''
Stores the words as vectors of size 1,000. If you have close to 1,000 words (or more), you’ll see many hash collisions, which will decrease the accuracy of this encoding method.
'''
dimensionality = 1000
max_length = 10

results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        '''
        Hashes the word into a random integer index
        between 0 and 1,000
        '''
        index = abs(hash(word)) % dimensionality
        results[i, j, index] = 1.

6.1.2 词嵌入

另外一种常用的、高效的文本向量化方法是稠密词向量，也称为词嵌入。one-hot编码得到的向量是二值的、稀疏的（大部分值为0）、高维度的（与字典的大小相同），而词嵌入是低维度的浮点型向量（意即，稠密向量），见图6.2。前面的向量是通过one-hot编码得到的，而词嵌入是由数据学习得到，最常见的词嵌入是256维、512维或者1024维。one-hot编码会导致向量的维度甚至超过20,000维（此处以20,000个词的字典举例）。所以词嵌入能够用更少的维度表示更多的信息。

image

图6.2 one-hot编码和词嵌入得到的向量对比

有两种获得词嵌入的方式：

在解决文档分类或者情感预测的任务中学习词嵌入。一般以随机词向量维开始，然后在训练神经网络模型权重的过程中学习到词向量。
加载预训练的词向量。预训练的词向量一般是从不同于当前要解决的机器学习任务中学习得到的。

下面学习前面的两种方法。

学习词嵌入：Embedding layer

词与稠密向量相关联的最简单方法是随机向量化。但是，这种方法使得嵌入空间变得毫无结构：比如，单词accurate和exact在大部分句子里是可互换的，但得到的嵌入可能完全不同。深度神经网络很难识别出这种噪音和非结构嵌入空间。

更抽象一点的讲，词与词之间的语义相似性在词向量空间中应该以几何关系表现出来。词嵌入可以理解成是人类语言到几何空间的映射过程。例如，你会期望同义词被嵌入为相似的词向量；更一般地说，你期望任意两个词向量的几何距离（比如，L2距离）和相关词的语义距离是有相关性。除了距离之外，词向量在嵌入空间的方向也应该是有意义的。下面举个具体的例子来说明这两点。

image

图6.3 词嵌入空间的实例

在图6.3中，cat、dog、wolf和tiger四个词被嵌入到二维平面空间。在这里选择的词向量表示时，这些词的语义关系能用几何变换来编码表示。比如，从cat到tiger和从dog到wolf有着相同的向量，该向量可以用“从宠物到野生动物”来解释。同样，从dog到cat和从wolf到tiger有相同的向量，该向量表示“从犬科到猫科动物”。

在实际的词嵌入空间中，常见的几何变换例子是“gender”词向量和“plural”词向量。比如，将“female”词向量加到“king”词向量上，可以得到“queen”词向量；将“plural”词向量加到“king”词向量上，可以得到“kings”词向量。

那接下来就要问了，有完美的词向量空间能匹配人类语言吗？能用来解决任意种类的自然语言处理任务吗？答案是可能有，但是现阶段暂时没有。也没有一种词向量可以向人类语言一样有很多种语言，并且是不同形的，因为它们都是在特定文化和特定环境下形成的。但是，怎么才能得到一个优秀的词嵌入空间呢？从程序实现上讲是因任务而异：英文影评情感分析模型对应完美词嵌入空间与英文文档分类模型对应的完美词嵌入空间可能不同，因为不同任务的语义关系重要性是变化的。

因此，对每个新任务来说，最好重新学习的词嵌入空间。幸运的是，反向传播算法和Keras使得学习词嵌入变得容易。下面学习Keras的Embedding layer权重。

#Listing 6.5 Instantiating an Embedding layer

from keras.layers import Embedding

'''
The Embedding layer takes at least two arguments: the number of possible tokens (here, 1,000: 1 + maximum word index) and the dimensionality of the embeddings (here, 64).
'''
embedding_layer = Embedding(1000, 64)

Embedding layer把词的整数索引映射为稠密向量。它输入整数，在中间字典中查找这些整数对应的向量。Embedding layer是一个高效的字典查表（见图6.4）。

image

图6.4 Embedding layer

Embedding layer的输入是一个形状为（样本，序列长度）[^（sample，sequence_length）]的 2D 整数型张量，该张量的每项都是一个整数序列。Embedding layer能嵌入变长序列：比如，可以“喂入”形状为（32，10）（长度为10的序列数据，32个为一个batch）或者（64，15）（长度为15的序列数据64个为一个batch）。同一个batch中的所有序列数据必须有相同的长度，因为它们会被打包成一个张量。所以比其它序列数据短的序列将用“0”填充，另外，太长的序列会被截断。

Embedding layer返回一个形状为（样本，序列长度，词向量大小）[^（samples，sequence_ length，embedding_dimensionality）]的3D浮点型张量，该张量可以被RNN layer或者1D 卷积layer处理。

当你实例化一个Embedding layer时，它的权重（词向量的中间字典）是随机初始化，和其它layer一样。随着模型的训练，这些词向量通过反向传播算法逐渐调整，传入下游模型使用。一旦模型训练完，嵌入空间会显现出许多结构，不同的模型会训练出不同的特定结构。

下面用熟悉的IMDB影评情感预测任务来说明上面的想法。首先，准备数据集。限制选取词频为top 10,000的常用词，只考虑影评前20个词。神经网络模型将学习8维的词嵌入，把输入的整数序列（2D整数张量）转化为嵌入序列（3D浮点张量）

#Listing 6.6 Loading the IMDB data for use with an Embedding layer

from keras.datasets import imdv
from keras import preprocessing

'''
Number of words to consider as features
'''
max_features = 10000
'''
Cuts off the text after this number of words (among the max_features most common words)
'''
maxlen = 20

'''
Loads the data as lists of integers
'''
(x_train, y_train), (x_test, y_test) = imdb.load_data( num_words=max_features)

'''
Turns the lists of integers into a 2D integer tensor of shape (samples, maxlen)
'''
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

#Listing 6.7 Using an Embedding layer and classifier on the IMDB data

from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
'''
Specifies the maximum input length to the Embedding layer so you can later flatten the embedded inputs. After the Embedding layer, the activations have shape (samples, maxlen, 8).
'''
model.add(Embedding(10000, 8, input_length=maxlen))
'''
Flattens the 3D tensor of embeddings into a 2D tensor of shape (samples, maxlen * 8)
'''
model.add(Flatten())

'''
Adds the classifier on top
'''
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

上面的代码得到了约76%的验证准确度，这对于只考虑每个影评的前20个词来说效果已经不错了。注意，仅仅摊平嵌入序列，用单个Dense layer训练模型，会将输入序列的每个词隔离开，并没有考虑词之间的关系和句子结构（例如，该模型可能认为“this movie is a bomb”和“this movie is the bomb” 两句话都是负面影评）。所以在嵌入序列之上加入RNN layer或者1D卷积layer会将句子当做整体来学习特征，后续小节会详细讲解这些。

预训练的词嵌入

有时，你只有很少的训练数据集来学习词嵌入，那怎么办呢？

你可以加载预计算好的词嵌入向量，而不用学习当前待解决任务的词嵌入。这些预计算好的词嵌入是高结构化的，具有有用的特性，其学习到了语言结构的泛化特征。在自然语言处理中使用预训练的词嵌入的基本理论，与图像分类中使用预训练的卷积网络相同：当没有足够的合适数据集来学习当前任务的特征时，你会期望从通用的视觉特征或者语义特征中学到泛化特征。

一些词嵌入是用词共现矩阵统计计算，用各种技术，有些涉及神经网络，有些没有。用非监督的方法计算词的稠密的、低维度的嵌入空间是由Bengio在2000年提出的，但是直到2013年Google的Tomas Mikolov开发出著名的Word2vec算法才开始在学术研究和工业应用上广泛推广。Word2vec可以获取语义信息。

Keras的Embedding layer有各种预训练词嵌入数据可以下载使用，Word2vec是其中之一。另外一个比较流行的词表示是GloVe（Global Vector），它是由斯坦福研究组在2014开发。GloVe是基于词共现矩阵分解的一种词嵌入技术，它的开发者预训练好了成千上万的词嵌入。

下面开始学习如何在Keras模型中使用GloVe词嵌入。其实它的使用方法与Word2vec词嵌入或者其它词嵌入数据相同。

6.1.3 从原始文本到词嵌入

这里的模型网络和上面的类似，只是换作预训练词嵌入。同时，直接从网上下载原始文本数据，而不是使用Keras分词好的IMDB数据。

下载IMDB原始文本

首先，前往http://mng.bz/0tIo下载原IMDB数据集，并解压。

接着，将单个训练影评装载为字符串列表，同时影评label装载为label的列表。

#Listing 6.8 Processing the labels of the raw IMDB data

import os

imdb_dir = '/Users/fchollet/Downloads/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, name))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

分词

开始向量化文本，准备训练集和验证集。因为预训练的词嵌入是对训练集较少时更好，这里加入步骤：取前200个样本数据集。所以你相当于只看了200条影评就开始做影评情感分类。

#Listing 6.9 Tokenizing the text of the raw IMDB data

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

'''
Cuts off reviews after 100 words
'''
maxlen = 100
'''
Trains on 200 samples
'''
training_samples = 200
'''
Validates on 10,000 samples
'''
validation_samples = 10000
'''
Considers only the top 10,000 words in the dataset
'''
max_words = 10000

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

'''
Splits the data into a training set and a validation set, but first shuffles the data, because you’re starting with data in which samples are ordered (all negative first, then all positive)
'''
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

下载GloVe词嵌入

前往https://nlp.stanford.edu/projects/glove下载预训练的2014年英文维基百科的GloVe词嵌入。它是一个822 MB的glove.6B.zip文件，包含400,000个词的100维嵌入向量。

预处理GloVe嵌入

下面解析解压的文件（a.txt）来构建索引，能将词映射为向量表示。

#Listing 6.10 Parsing the GloVe word-embeddings file

glove_dir = '/Users/fchollet/Downloads/glove.6B'

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = chefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

接着，构建能载入Embedding layer的嵌入矩阵。它的矩阵形状为（max_words, embedding_dim），其每项i是在参考词索引中为i的词对应的embedding_dim维向量。注意，索引0不代表任何词，只是个占位符。

#Listing 6.11 Preparing the GloVe word-embeddings matrix

embedding_dim = 100

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            '''
            Words not found in the embedding index will be all zeros.
            '''
            embedding_matrix[i] = embedding_vector

定义模型

使用前面相同的模型结构。

#Listing 6.12 Model definition

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

加载GloVe词嵌入

Embedding layer有一个权重矩阵：2D浮点型矩阵，每项i表示索引为i的词对应的词向量。在神经网络模型中加载GloVe词嵌入到Embedding layer

#Listing 6.13 Loading pretrained word embeddings into the Embedding layer

model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

此外，设置trainable为False，冻结Embedding layer。当一个模型的部分网络是预训练的（像Embedding layer）或者随机初始化（像分类），那该部分网络在模型训练过程中不能更新，避免模型忘记已有的特征。随机初始化layer会触发大的梯度更新，导致已经学习的特征丢失。

训练和评估模型

编译和训练模型。

#Listing 6.14 Training and evaluation

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')

现在绘制模型随时间的表现，见图6.5和6.6。

#Listing 6.15 Plotting the results

import matplotlib.pyplot as pet

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

image

图6.5 使用预训练词嵌入时的训练损失和验证损失曲线

image

图6.6 使用预训练词嵌入时的训练准确度和验证准确度曲线

模型训练在开始不久即出现过拟合，这在训练集较少的情况下很常见。验证准确度有高的variance，不过也到50%了。

可能你的结果不同：因为训练集太少，导致模型效果严重依赖被选择的200个样本（这里选择是随机的）。

你也可以在不加载预训练词嵌入和不冻结embedding layer的情况下训练相同的网络模型。训练集也使用前面相同的200个样本，见图6.7和6.8。

#Listing 6.16 Training the same model without pretrained word embeddings

from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen)) model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

image

图6.7 未使用预训练词嵌入时的训练损失和验证损失曲线

image

图6.8 未使用预训练词嵌入时的训练准确度和验证准确度曲线

这次的结果显示验证准确度不到50%。所以样本量较少的情况下，预训练词嵌入效果更优。

最后，在测试数据集上评估模型。首先，对测试数据进行分词。

#Listing 6.17 Tokenizing the data of the test set

test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, name))
            texts.append(f.read())
            f.close()
                    if label_type == 'neg':
                       labels.append(0)
                   else:
                       labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

接着，加载并评估第一个模型。

#Listing 6.18 Evaluating the model on the test set

model.load_weights('pre_trained_glove_model.h5') model.evaluate(x_test, y_test)

返回测试准确度56%的结果。

6.1.4 小结

你学到的知识有：

文本分词
使用Keras的Embedding layer学习特定的词嵌入
使用预训练的词嵌入提升自然语言处理问题

未完待续。。。

Enjoy!

翻译本书系列的初衷是，觉得其中把深度学习讲解的通俗易懂。不光有实例，也包含作者多年实践对深度学习概念、原理的深度理解。最后说不重要的一点，François Chollet是Keras作者。
声明本资料仅供个人学习交流、研究，禁止用于其他目的。如果喜欢，请购买英文原版。

侠天，专注于大数据、机器学习和数学相关的内容，并有个人公众号分享相关技术文章。

若发现以上文章有任何不妥，请联系我。

最后编辑于：2021.12.16 15:24:48

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,560评论 4赞 361
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,104评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,297评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,869评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,275评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,563评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,833评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,543评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,245评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,512评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,011评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,359评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,006评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,062评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,825评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,590评论 2赞 273
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,501评论 2赞 268

《Deep Learning with Python》第六章 6.1 深度学习之文本处理

6.1 深度学习之文本处理

6.1.1 字词的one-hot编码

6.1.2 词嵌入

6.1.3 从原始文本到词嵌入

6.1.4 小结

推荐阅读更多精彩内容