02-seq2seq原理与实践

目录

原理部分

  • 机器翻译发展历史
  • Seq2Seq网络基本架构
  • Seq2Seq网络应用
  • Seq2Seq存在的问题
  • Attention机制

实践部分

  • 任务1:
    • 数据预处理
    • 编码层与词向量
    • 完成解码模块
    • 模型迭代
  • 任务2:
    • 数据预处理
    • 使用构建好的词向量
    • 完成解码操作
    • 任务总结


在进行学习Seq2Seq之前,先来回顾一下RNN(图1)和LSTM(图2)的网络架构。

图1 RNN网络架构
图2 LSTM网络架构

原理部分

机器翻译的历史

图3 最早期的逐字翻译

逐字翻译出来的结果明显不符合人类日常语言交流的常态,语言生硬或者不符合语义,于是就发展到了基于统计学的机器翻译,但是它也明显的缺点就是不包含上下文的信息。


图4 基于统计学的机器翻译

以及现在的基于循环网络(RNN)和编码(word embedding)的机器翻译。如图5


图5 基于深度学习的机器翻译

有了输入的内容,并对其进行编码,有利用计算机进行计算和处理,处理完成后我们还需要对其进行解码操作。如图6


i图6 基于深度学习的机器翻译

现在有用户输入一段英文文本序列想要得到对应的西班牙语文本翻译。

  • 首先进行Input,接收到用户输入的文本序列。
  • 其次,进入编码器Encoder(如RNN),将文本序列进行编码,得到比如为维度是3维的数据形式向量
  • 然后,将3维的向量输入到解码器Decoder中
  • 最后,得到解码后的文本

其实概览全局,整个流程就是从用户那里得到一段文本序列(Sequence)经过计算机的处理(To),即输入和编码;最终得到了对应的文本序列(Sequence),即输出和解码,其实这也就是seq2seq的流程。

Seq2Seq的网络架构

整个网络模型分为Encoder和Decoder,两个部分接连着一个中间向量。

  • Encoder是一个RNN网络,其隐藏层包含有若干个单元。每个单元都是一个LSTM单元。Encoder输出的结果是经过处理的向量,并作为Decoder的输入。
  • 同理,Decoder结构与Encoder结构类似,每一个单元的输入是前一个单元的输出,即每步得出一个结果。
  • 该模型训练有一个缺点,就是语料数据很难获取。

以下图7为例,现在收到了一封邮件,内容为Are you free tomorrow。最终想要得到Yes, What's up?的回复。Tips:STATRT为开始符(有的论文用GO表示);
END为终止符,作为解码器解码终止的标志,有的论文称为EOS(End of sentences.),这就需要在数据预处理的过程中在训练数据中加入。

图7 Seq2Seq网络架构实例

Seq2Seq的应用

  • 机器翻译


    图8 Seq2Seq网络应用-机器翻译
  • 文本摘要


    图9 Seq2Seq网络应用-文本摘要
  • 情感对话生成


    图10 Seq2Seq网络应用-情感对话生成
  • 代码补全


    图11 Seq2Seq网络应用-代码补全

Seq2Seq存在的问题

  • 压缩损失了信息
    如图12,在进行模型训练前,对文本需要进行embedding,即将文本映射为向量,然后通过LSTM单元,但是即使LSTM控制保留的信息再好,压缩到最后一个节点那里也总是会丢失信息,那么就会对对最后的预测结果会产生影响。


    图12 LSTM中的信息丢失问题
  • 长度限制
    如果输入的序列过长,训练出来的模型表达效果也不会太出色,一般理想长度为10-20.如图13。


    图13 Seq2Seq受到文本长度的影响

Attention机制

基于以上的问题,在模型中加入Attention注意力机制,具体原理可以看02-注意力机制-attention机制(基于循环神经网络RNN)这篇文章。

Attention机制在计算机视觉领域中的解释是这样的,“高分辨率”聚焦在图片的某个特定区域并以“低分辨率”感知图像周边区域的模式。通过大量的实验证明,将attention机制应用在机器翻译、摘要生成、阅读理解等问题上,取得的效果显著。

另外还有一种Bucket机制,比如现在有很多组对话,有些对话长度为0-100字符,那么相应的进行模型训练后,输出的区间也会是这样0-100字符。正常情况下,应该对所有的的句子进行补全,但是的工作量会增加。
Bucket机制则是对所有的句子先进行分组,将句子根据不同的区间分为若干个组,如bucket1[10,10],bucket2[10-30,20-30],bucket3[30-100,30-100]等,然后再进行计算。即,如果我们要进行模型训练,发现语料数据的长度变化幅度有点大,那么就可以考虑加入Bucket机制。(在TensorFlow深度学习框架中进行seq2seq网络训练时,默认进行Bucket)。

实践部分

任务1:

任务1将实现一个基础版的Seq2Seq输入一个单词(字母序列),模型将返回一个对字母排序后的“单词”

基础Seq2Seq主要包含三部分:

如:将文本按照字典顺序排序:hello --> ehllo

查看TensorFlow版本

from distutils.version import LooseVersion
import tensorflow as tf
from tensorflow.python.layers.core import Dense


# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.1'), 'Please use TensorFlow version 1.1 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

如果缺少某些包,到该网站下载即可,不过可能网速可能过慢。http://www.lfd.uci.edu/~gohlke/pythonlibs/#tensorflow

1.数据集加载

import numpy as np
import time
import tensorflow as tf

with open('data/letters_source.txt', 'r', encoding='utf-8') as f:  # 
    source_data = f.read()

with open('data/letters_target.txt', 'r', encoding='utf-8') as f:
    target_data = f.read()
1.1数据预览
print(source_data.split('\n')[:10])
print(target_data.split('\n')[:10])

source输出结果为:
['bsaqq',
'npy',
'lbwuj',
'bqv',
'kial',
'tddam',
'edxpjpg',
'nspv',
'huloz',
'kmclq']

target输出结果为:
['abqqs',
'npy',
'bjluw',
'bqv',
'aikl',
'addmt',
'degjppx',
'npsv',
'hlouz',
'cklmq']

source为准备数据,即准备输入的数据,作为训练集。
target为目标数据,即预测实现的数据,作为测试集。

2.数据预处理

这里的数据预处理,是将待输入的文本映射为连续低维稠密向量,便于模型进行训练。

def extract_character_vocab(data):
    '''
    构造映射表
    '''
    # 这里构造特殊词表,便于执行特殊操作如开始GO、停止EOS、未知向量UNK(多出现在不规范的数据集中,无法对其进行映射时)和PAD(对文本进行填充保证每次大小都是一样的,如RNN中的零填充)。
    special_words = ['<PAD>', '<UNK>', '<GO>',  '<EOS>']  

    set_words = list(set([character for line in data.split('\n') for character in line]))  # 统计不重复的字符,转换为列表,便于之后进行embedding
    # 这里要把四个特殊字符添加进词典
    int_to_vocab = {idx: word for idx, word in enumerate(special_words + set_words)}  # 利用枚举方法做映射,完成数据预处理
    vocab_to_int = {word: idx for idx, word in int_to_vocab.items()}

    return int_to_vocab, vocab_to_int
2.1调用构造好的函数进行数据预处理
# 构造映射表
source_int_to_letter, source_letter_to_int = extract_character_vocab(source_data)
target_int_to_letter, target_letter_to_int = extract_character_vocab(target_data)

# 对字母进行转换
source_int = [[source_letter_to_int.get(letter, source_letter_to_int['<UNK>']) 
               for letter in line] for line in source_data.split('\n')]
target_int = [[target_letter_to_int.get(letter, target_letter_to_int['<UNK>']) 
               for letter in line] + [target_letter_to_int['<EOS>']] for line in target_data.split('\n')] 
2.2查看映射结果
# 查看一下转换结果
print(source_int[:10])
print(target_int[:10])

结果1:
[[17, 9, 12, 11, 11], # bsaqq
[16, 29, 26],
[13, 17, 15, 25, 8],
[17, 11, 4],
[18, 10, 12, 13],
[23, 7, 7, 12, 24],
[27, 7, 6, 29, 8, 29, 5],
[16, 9, 29, 4],
[28, 25, 13, 21, 20],
[18, 24, 22, 13, 11]]
结果2:
[[12, 17, 11, 11, 9, 3], # abqqs,可以看到这里的3代表加入的特殊符号EOS
[16, 29, 26, 3],
[17, 8, 13, 25, 15, 3],
[17, 11, 4, 3],
[12, 10, 18, 13, 3],
[12, 7, 7, 24, 23, 3],
[7, 27, 5, 8, 29, 29, 6, 3],
[16, 29, 9, 4, 3],
[28, 13, 21, 25, 20, 3],
[22, 18, 13, 24, 11, 3]]

3.构建模型

3.1输入层
def get_inputs():
    '''
    模型输入tensor
    '''
    inputs = tf.placeholder(tf.int32, [None, None], name='inputs')  # 用placeholder进行占位,形状不指定根据训练数据变化
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    learning_rate = tf.placeholder(tf.float32, name='learning_rate')  # 同理,这里替学习率进行占位
    
    # 定义target序列最大长度(之后target_sequence_length和source_sequence_length会作为feed_dict的参数)
    target_sequence_length = tf.placeholder(tf.int32, (None,), name='target_sequence_length')
    max_target_sequence_length = tf.reduce_max(target_sequence_length, name='max_target_len')  # 这里计算序列最大长度项,便于之后根据此进行填充 
    source_sequence_length = tf.placeholder(tf.int32, (None,), name='source_sequence_length')
    
    return inputs, targets, learning_rate, target_sequence_length, max_target_sequence_length, source_sequence_length
3.2Encoder端

在Encoder端,我们需要进行两步:

  • 第一步要对我们的输入进行Embedding;
  • 再把Embedding好的向量传给RNN进行处理。
将要使用到的API介绍:

在Embedding中,我们使用tf.contrib.layers.embed_sequence,它会对每个batch执行embedding操作。

  • tf.contrib.layers.embed_sequence:

对序列数据执行embedding操作,输入[batch_size, sequence_length]的tensor,返回[batch_size, sequence_length, embed_dim]的tensor。

features = [[1,2,3],[4,5,6]]

outputs = tf.contrib.layers.embed_sequence(features, vocab_size, embed_dim)

如果embed_dim=4,输出结果为

[
[[0.1,0.2,0.3,0.1],[0.2,0.5,0.7,0.2],[0.1,0.6,0.1,0.2]],
[[0.6,0.2,0.8,0.2],[0.5,0.6,0.9,0.2],[0.3,0.9,0.2,0.2]]
]

  • tf.contrib.rnn.MultiRNNCell:

对RNN单元按序列堆叠。接受参数为一个由RNN cell组成的list。

rnn_size代表一个rnn单元中隐层节点数量,layer_nums代表堆叠的rnn cell个数

  • tf.nn.dynamic_rnn:

构建RNN,接受动态输入序列。返回RNN的输出以及最终状态的tensor。

dynamic_rnn与rnn的区别在于,dynamic_rnn对于不同的batch,可以接收不同的sequence_length。

例如,第一个batch是[batch_size,10],第二个batch是[batch_size,20]。而rnn只能接收定长的sequence_length。

def get_encoder_layer(input_data, rnn_size, num_layers,
                   source_sequence_length, source_vocab_size, 
                   encoding_embedding_size):

    '''
    构造Encoder层,其实也就是一个简单的RNN模型
    
    参数说明:
    - input_data: 输入tensor,输入数据
    - rnn_size: rnn隐层结点数量
    - num_layers: 堆叠的rnn cell数量
    - source_sequence_length: 源数据的序列长度
    - source_vocab_size: 源数据的词典大小,词库大小(不重复的词)
    - encoding_embedding_size: embedding的大小,映射成向量后的维度
    '''
    # Encoder embedding
    encoder_embed_input = tf.contrib.layers.embed_sequence(input_data, source_vocab_size, encoding_embedding_size)

    # RNN cell,以随机初始化的方式构造基本的LSTM单元
    def get_lstm_cell(rnn_size):
        lstm_cell = tf.contrib.rnn.LSTMCell(rnn_size, initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))  
        return lstm_cell

    # 根据基本的LSTM单元,构造多隐层的RNN网络,有几层隐层,就把几层的LSTM单元组合在一起
    cell = tf.contrib.rnn.MultiRNNCell([get_lstm_cell(rnn_size) for _ in range(num_layers)])  
    
    # 构建RNN,接受动态输入序列。返回RNN的输出以及最终状态的tensor
    encoder_output, encoder_state = tf.nn.dynamic_rnn(cell, encoder_embed_input, 
                                                      sequence_length=source_sequence_length, dtype=tf.float32)  # cell是构造好的网络,映射向量,序列长度
    
    return encoder_output, encoder_state
3.3Decoder端

对target数据进行预处理:
预处理包括加入停止词,保证数据的维度一致等。


图14 数据预处理后的示意图
def process_decoder_input(data, vocab_to_int, batch_size):
    '''
    补充<GO>,并移除最后一个字符 
    '''
    # cut掉最后一个字符
    ending = tf.strided_slice(data, [0, 0], [batch_size, -1], [1, 1])
    decoder_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return decoder_input
3.4对target数据进行embedding

同样地,我们还需要对target数据进行embedding,使得它们能够传入Decoder中的RNN。

将要使用到的API介绍:
  • tf.contrib.seq2seq.TrainingHelper:

Decoder端用来训练的函数。

这个函数不会把t-1阶段的输出作为t阶段的输入,而是把target中的真实值直接输入给RNN。

主要参数是inputs和sequence_length。返回helper对象,可以作为BasicDecoder函数的参数。

  • tf.contrib.seq2seq.GreedyEmbeddingHelper:

它和TrainingHelper的区别在于它会把t-1下的输出进行embedding后再输入给RNN。

下面的图15中代表的是training过程:

在training过程中,我们并不会把每个阶段的预测输出作为下一阶段的输入,下一阶段的输入我们会直接使用target data真实值,这样能够保证模型更加准确。

图15 Decoder端训练过程.png

def decoding_layer(target_letter_to_int, decoding_embedding_size, num_layers, rnn_size,
                   target_sequence_length, max_target_sequence_length, encoder_state, decoder_input):
    '''
    构造Decoder层
    
    参数:
    - target_letter_to_int: target数据的映射表
    - decoding_embedding_size: embed向量大小
    - num_layers: 堆叠的RNN单元数量
    - rnn_size: RNN单元的隐层结点数量
    - target_sequence_length: target数据序列长度
    - max_target_sequence_length: target数据序列最大长度
    - encoder_state: encoder端编码的状态向量
    - decoder_input: decoder端输入
    '''
    # 1. Embedding
    target_vocab_size = len(target_letter_to_int)  # 计算最终词库的大小
    decoder_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size])) # 定义映射矩阵
    decoder_embed_input = tf.nn.embedding_lookup(decoder_embeddings, decoder_input)  # 查看当前的映射结果

    # 2. 构造Decoder中的RNN单元
    def get_decoder_cell(rnn_size):  
      """构造基本的LSTM单元"""
        decoder_cell = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
        return decoder_cell
    cell = tf.contrib.rnn.MultiRNNCell([get_decoder_cell(rnn_size) for _ in range(num_layers)])  # 构造RNN网络
     
    # 3. Output全连接层,相当于是加上Softmax,对得出的结果进行分类
    output_layer = Dense(target_vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))


    # 4. Training decoder,训练decoder,LSTM单元直接用label去做输入
    with tf.variable_scope("decode"):
        # 得到help对象
        training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=decoder_embed_input,
                                                            sequence_length=target_sequence_length,
                                                            time_major=False)
        # 构造基本的decoder
        training_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                           training_helper,
                                                           encoder_state,
                                                           output_layer) 
        # 得到decoder训练后的输出值
        training_decoder_output, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                                       impute_finished=True,
                                                                       maximum_iterations=max_target_sequence_length)

    # 5. Predicting decoder,预测decoder,LSTM单元用前一阶段的输出去做输入
    # 与training共享参数
    with tf.variable_scope("decode", reuse=True):  # 作用域与4相同,reuse=Ture,说明与上一阶段的参数是共享的
        # 创建一个常量tensor并复制为batch_size的大小
        start_tokens = tf.tile(tf.constant([target_letter_to_int['<GO>']], dtype=tf.int32), [batch_size], 
                               name='start_tokens')
        predicting_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(decoder_embeddings,
                                                                start_tokens,
                                                                target_letter_to_int['<EOS>'])
        predicting_decoder = tf.contrib.seq2seq.BasicDecoder(cell,
                                                        predicting_helper,
                                                        encoder_state,
                                                        output_layer)
        predicting_decoder_output, _ = tf.contrib.seq2seq.dynamic_decode(predicting_decoder,
                                                            impute_finished=True,
                                                            maximum_iterations=max_target_sequence_length)
    
    return training_decoder_output, predicting_decoder_output
3.5构建seq2seq模型

上面已经构建完成Encoder和Decoder,下面将这两部分连接起来,构建seq2seq模型

def seq2seq_model(input_data, targets, lr, target_sequence_length, 
                  max_target_sequence_length, source_sequence_length,
                  source_vocab_size, target_vocab_size,
                  encoder_embedding_size, decoder_embedding_size, 
                  rnn_size, num_layers):
    
    # 获取encoder的状态输出
    _, encoder_state = get_encoder_layer(input_data, 
                                  rnn_size, 
                                  num_layers, 
                                  source_sequence_length,
                                  source_vocab_size, 
                                  encoding_embedding_size)
    
    
    # 预处理后的decoder输入
    decoder_input = process_decoder_input(targets, target_letter_to_int, batch_size)
    
    # 将状态向量与输入传递给decoder
    training_decoder_output, predicting_decoder_output = decoding_layer(target_letter_to_int, 
                                                                       decoding_embedding_size, 
                                                                       num_layers, 
                                                                       rnn_size,
                                                                       target_sequence_length,
                                                                       max_target_sequence_length,
                                                                       encoder_state, 
                                                                       decoder_input) 
    
    return training_decoder_output, predicting_decoder_output
    

超参数设置
# 超参数
# Number of Epochs
epochs = 60
# Batch Size
batch_size = 128
# RNN Size
rnn_size = 50
# Number of Layers
num_layers = 2
# Embedding Size
encoding_embedding_size = 15
decoding_embedding_size = 15
# Learning Rate
learning_rate = 0.001
构造graph
# 构造graph
train_graph = tf.Graph()

with train_graph.as_default():
    
    # 获得模型输入    
    input_data, targets, lr, target_sequence_length, max_target_sequence_length, source_sequence_length = get_inputs()
    
    training_decoder_output, predicting_decoder_output = seq2seq_model(input_data, 
                                                                      targets, 
                                                                      lr, 
                                                                      target_sequence_length, 
                                                                      max_target_sequence_length, 
                                                                      source_sequence_length,
                                                                      len(source_letter_to_int),
                                                                      len(target_letter_to_int),
                                                                      encoding_embedding_size, 
                                                                      decoding_embedding_size, 
                                                                      rnn_size, 
                                                                      num_layers)    
    
    training_logits = tf.identity(training_decoder_output.rnn_output, 'logits')
    predicting_logits = tf.identity(predicting_decoder_output.sample_id, name='predictions')
    
    masks = tf.sequence_mask(target_sequence_length, max_target_sequence_length, dtype=tf.float32, name='masks')  # 不将EOS等特殊符号参与运算

    with tf.name_scope("optimization"):
        
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(lr)  # 优化器

        # Gradient Clipping 基于定义的min与max对tesor数据进行截断操作,目的是为了应对梯度爆发或者梯度消失的情况
        gradients = optimizer.compute_gradients(cost)  # 梯度求解
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]  # 为梯度求解指定范围
        train_op = optimizer.apply_gradients(capped_gradients)

4.batch批处理

def pad_sentence_batch(sentence_batch, pad_int):
    '''
    对batch中的序列进行补全,保证batch中的每行都有相同的sequence_length
    
    参数:
    - sentence batch
    - pad_int: <PAD>对应索引号
    '''
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [pad_int] * (max_sentence - len(sentence)) for sentence in sentence_batch]
def get_batches(targets, sources, batch_size, source_pad_int, target_pad_int):
    '''
    定义生成器,用来获取batch
    '''
    for batch_i in range(0, len(sources)//batch_size):
        start_i = batch_i * batch_size
        sources_batch = sources[start_i:start_i + batch_size]  # 指定索引符,将数据取出
        targets_batch = targets[start_i:start_i + batch_size]
        # 补全序列
        pad_sources_batch = np.array(pad_sentence_batch(sources_batch, source_pad_int))
        pad_targets_batch = np.array(pad_sentence_batch(targets_batch, target_pad_int))
        
        # 记录每条记录的长度
        pad_targets_lengths = []
        for target in pad_targets_batch:
            pad_targets_lengths.append(len(target))
        
        pad_source_lengths = []
        for source in pad_sources_batch:
            pad_source_lengths.append(len(source))
        
        yield pad_targets_batch, pad_sources_batch, pad_targets_lengths, pad_source_lengths

5.Training训练

# 将数据集分割为train和validation
train_source = source_int[batch_size:]
train_target = target_int[batch_size:]
# 留出一个batch进行验证
valid_source = source_int[:batch_size]
valid_target = target_int[:batch_size]
(valid_targets_batch, valid_sources_batch, valid_targets_lengths, valid_sources_lengths) = next(get_batches(valid_target, valid_source, batch_size,
                           source_letter_to_int['<PAD>'],
                           target_letter_to_int['<PAD>']))

display_step = 50 # 每隔50轮输出loss

checkpoint = "trained_model.ckpt" 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
        
    for epoch_i in range(1, epochs+1):
        for batch_i, (targets_batch, sources_batch, targets_lengths, sources_lengths) in enumerate(
                get_batches(train_target, train_source, batch_size,
                           source_letter_to_int['<PAD>'],
                           target_letter_to_int['<PAD>'])):
            
            _, loss = sess.run(
                [train_op, cost],
                {input_data: sources_batch,
                 targets: targets_batch,
                 lr: learning_rate,
                 target_sequence_length: targets_lengths,
                 source_sequence_length: sources_lengths})

            if batch_i % display_step == 0:
                
                # 计算validation loss
                validation_loss = sess.run(
                [cost],
                {input_data: valid_sources_batch,
                 targets: valid_targets_batch,
                 lr: learning_rate,
                 target_sequence_length: valid_targets_lengths,
                 source_sequence_length: valid_sources_lengths})
                
                print('Epoch {:>3}/{} Batch {:>4}/{} - Training Loss: {:>6.3f}  - Validation loss: {:>6.3f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(train_source) // batch_size, 
                              loss, 
                              validation_loss[0]))

    
    
    # 保存模型
    saver = tf.train.Saver()
    saver.save(sess, checkpoint)
    print('Model Trained and Saved')
结果:

Epoch 1/60 Batch 50/77 - Training Loss: 2.332 - Validation loss: 2.091
Epoch 2/60 Batch 50/77 - Training Loss: 1.803 - Validation loss: 1.593
Epoch 3/60 Batch 50/77 - Training Loss: 1.550 - Validation loss: 1.379
Epoch 4/60 Batch 50/77 - Training Loss: 1.343 - Validation loss: 1.184
Epoch 5/60 Batch 50/77 - Training Loss: 1.230 - Validation loss: 1.077
Epoch 6/60 Batch 50/77 - Training Loss: 1.096 - Validation loss: 0.956
Epoch 7/60 Batch 50/77 - Training Loss: 0.993 - Validation loss: 0.849
Epoch 8/60 Batch 50/77 - Training Loss: 0.893 - Validation loss: 0.763
Epoch 9/60 Batch 50/77 - Training Loss: 0.808 - Validation loss: 0.673
Epoch 10/60 Batch 50/77 - Training Loss: 0.728 - Validation loss: 0.600
Epoch 11/60 Batch 50/77 - Training Loss: 0.650 - Validation loss: 0.539
Epoch 12/60 Batch 50/77 - Training Loss: 0.594 - Validation loss: 0.494
Epoch 13/60 Batch 50/77 - Training Loss: 0.560 - Validation loss: 0.455
Epoch 14/60 Batch 50/77 - Training Loss: 0.502 - Validation loss: 0.411
Epoch 15/60 Batch 50/77 - Training Loss: 0.464 - Validation loss: 0.380
Epoch 16/60 Batch 50/77 - Training Loss: 0.428 - Validation loss: 0.352
Epoch 17/60 Batch 50/77 - Training Loss: 0.394 - Validation loss: 0.323
Epoch 18/60 Batch 50/77 - Training Loss: 0.364 - Validation loss: 0.297
Epoch 19/60 Batch 50/77 - Training Loss: 0.335 - Validation loss: 0.270
Epoch 20/60 Batch 50/77 - Training Loss: 0.305 - Validation loss: 0.243
Epoch 21/60 Batch 50/77 - Training Loss: 0.311 - Validation loss: 0.248
Epoch 22/60 Batch 50/77 - Training Loss: 0.253 - Validation loss: 0.203
Epoch 23/60 Batch 50/77 - Training Loss: 0.227 - Validation loss: 0.182
Epoch 24/60 Batch 50/77 - Training Loss: 0.204 - Validation loss: 0.165
Epoch 25/60 Batch 50/77 - Training Loss: 0.184 - Validation loss: 0.150
Epoch 26/60 Batch 50/77 - Training Loss: 0.166 - Validation loss: 0.136
Epoch 27/60 Batch 50/77 - Training Loss: 0.150 - Validation loss: 0.124
Epoch 28/60 Batch 50/77 - Training Loss: 0.135 - Validation loss: 0.113
Epoch 29/60 Batch 50/77 - Training Loss: 0.121 - Validation loss: 0.103
Epoch 30/60 Batch 50/77 - Training Loss: 0.109 - Validation loss: 0.094
Epoch 31/60 Batch 50/77 - Training Loss: 0.098 - Validation loss: 0.086
Epoch 32/60 Batch 50/77 - Training Loss: 0.088 - Validation loss: 0.079
Epoch 33/60 Batch 50/77 - Training Loss: 0.079 - Validation loss: 0.073
Epoch 34/60 Batch 50/77 - Training Loss: 0.071 - Validation loss: 0.067
Epoch 35/60 Batch 50/77 - Training Loss: 0.063 - Validation loss: 0.062
Epoch 36/60 Batch 50/77 - Training Loss: 0.057 - Validation loss: 0.057
Epoch 37/60 Batch 50/77 - Training Loss: 0.052 - Validation loss: 0.053
Epoch 38/60 Batch 50/77 - Training Loss: 0.047 - Validation loss: 0.049
Epoch 39/60 Batch 50/77 - Training Loss: 0.043 - Validation loss: 0.045
Epoch 40/60 Batch 50/77 - Training Loss: 0.039 - Validation loss: 0.042
Epoch 41/60 Batch 50/77 - Training Loss: 0.036 - Validation loss: 0.039
Epoch 42/60 Batch 50/77 - Training Loss: 0.033 - Validation loss: 0.037
Epoch 43/60 Batch 50/77 - Training Loss: 0.030 - Validation loss: 0.034
Epoch 44/60 Batch 50/77 - Training Loss: 0.028 - Validation loss: 0.032
Epoch 45/60 Batch 50/77 - Training Loss: 0.026 - Validation loss: 0.029
Epoch 46/60 Batch 50/77 - Training Loss: 0.024 - Validation loss: 0.028
Epoch 47/60 Batch 50/77 - Training Loss: 0.027 - Validation loss: 0.029
Epoch 48/60 Batch 50/77 - Training Loss: 0.030 - Validation loss: 0.030
Epoch 49/60 Batch 50/77 - Training Loss: 0.023 - Validation loss: 0.026
Epoch 50/60 Batch 50/77 - Training Loss: 0.021 - Validation loss: 0.024
Epoch 51/60 Batch 50/77 - Training Loss: 0.019 - Validation loss: 0.022
Epoch 52/60 Batch 50/77 - Training Loss: 0.017 - Validation loss: 0.021
Epoch 53/60 Batch 50/77 - Training Loss: 0.016 - Validation loss: 0.020
Epoch 54/60 Batch 50/77 - Training Loss: 0.015 - Validation loss: 0.019
Epoch 55/60 Batch 50/77 - Training Loss: 0.014 - Validation loss: 0.018
Epoch 56/60 Batch 50/77 - Training Loss: 0.013 - Validation loss: 0.018
Epoch 57/60 Batch 50/77 - Training Loss: 0.012 - Validation loss: 0.017
Epoch 58/60 Batch 50/77 - Training Loss: 0.011 - Validation loss: 0.016
Epoch 59/60 Batch 50/77 - Training Loss: 0.011 - Validation loss: 0.016
Epoch 60/60 Batch 50/77 - Training Loss: 0.010 - Validation loss: 0.015
Model Trained and Saved

6.Predicate预测

def source_to_seq(text):
    '''
    对源数据进行转换
    '''
    sequence_length = 7
    return [source_letter_to_int.get(word, source_letter_to_int['<UNK>']) for word in text] + [source_letter_to_int['<PAD>']]*(sequence_length-len(text))
# 输入一个单词
input_word = 'common'
text = source_to_seq(input_word)

checkpoint = "./trained_model.ckpt"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # 加载模型
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('inputs:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    source_sequence_length = loaded_graph.get_tensor_by_name('source_sequence_length:0')
    target_sequence_length = loaded_graph.get_tensor_by_name('target_sequence_length:0')
    
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      target_sequence_length: [len(text)]*batch_size, 
                                      source_sequence_length: [len(text)]*batch_size})[0] 


pad = source_letter_to_int["<PAD>"] 

print('原始输入:', input_word)

print('\nSource')
print('  Word 编号:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([source_int_to_letter[i] for i in text])))

print('\nTarget')
print('  Word 编号:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([target_int_to_letter[i] for i in answer_logits if i != pad])))
结果展示:

INFO:tensorflow:Restoring parameters from ./trained_model.ckpt
原始输入: common

Source
Word 编号: [20, 28, 6, 6, 28, 5, 0]
Input Words: c o m m o n <PAD>

Target
Word 编号: [20, 6, 6, 5, 28, 28, 3]
Response Words: c m m n o o <EOS>

任务2:文本摘要练习

数据集:Amazon 500000评论
分为以下步骤进行:

  • 数据预处理
  • 构建Seq2Seq模型
  • 训练网络
  • 测试效果

seq2seq教程: https://github.com/j-min/tf_tutorial_plus/tree/master/RNN_seq2seq/contrib_seq2seq 国外大神写的Seq2Seq的tutorial

1.导入需要的外部库

import pandas as pd
import numpy as np
import tensorflow as tf
import re
from nltk.corpus import stopwords
import time
from tensorflow.python.layers.core import Dense
from tensorflow.python.ops.rnn_cell_impl import _zero_state_tensors
print('TensorFlow Version: {}'.format(tf.__version__))

2.导入数据

reviews = pd.read_csv("Reviews.csv")
print(reviews.shape)
print(reviews.head())

结果为:
(568454, 10)

Id  ProductId   UserId  ProfileName HelpfulnessNumerator    HelpfulnessDenominator  Score   Time    Summary Text

0 1 B001E4KFG0 A3SGXH7AUHU8GW delmartian 1 1 5 1303862400 Good Quality Dog Food I have bought several of the Vitality canned d...
1 2 B00813GRG4 A1D87F6ZCVE5NK dll pa 0 0 1 1346976000 Not as Advertised Product arrived labeled as Jumbo Salted Peanut...
2 3 B000LQOCH0 ABXLMWJIXXAIN Natalia Corres "Natalia Corres" 1 1 4 1219017600 "Delight" says it all This is a confection that has been around a fe...
3 4 B000UA0QIQ A395BORC6FGVXV Karl 3 3 2 1307923200 Cough Medicine If you are looking for the secret ingredient i...
4 5 B006K2ZZ7K A1UQRSCLF8GW1T Michael D. Bigham "M. Wassir" 0 0 5 1350777600 Great taffy Great taffy at a great price. There was a wid...

2.1检查空数据
# Check for any nulls values
reviews.isnull().sum()
2.2删除空值和不需要的特征
# Remove null values and unneeded features
reviews = reviews.dropna()
reviews = reviews.drop(['Id','ProductId','UserId','ProfileName','HelpfulnessNumerator','HelpfulnessDenominator',
                        'Score','Time'], 1)
reviews = reviews.reset_index(drop=True)

reviews.head()
2.3查看部分数据
# Inspecting some of the reviews
for i in range(5):
    print("Review #",i+1)
    print(reviews.Summary[i])
    print(reviews.Text[i])
    print()

3.数据预处理

主要处理任务:

  • 全部转换成小写
  • 连词转换
  • 去停用词(只在描述中去掉)
3.1设置缩写词列表

contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he's": "he is",
"how'd": "how did",
"how'll": "how will",
"how's": "how is",
"i'd": "i would",
"i'll": "i will",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'll": "it will",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"must've": "must have",
"mustn't": "must not",
"needn't": "need not",
"oughtn't": "ought not",
"shan't": "shall not",
"sha'n't": "shall not",
"she'd": "she would",
"she'll": "she will",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"that'd": "that would",
"that's": "that is",
"there'd": "there had",
"there's": "there is",
"they'd": "they would",
"they'll": "they will",
"they're": "they are",
"they've": "they have",
"wasn't": "was not",
"we'd": "we would",
"we'll": "we will",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"where'd": "where did",
"where's": "where is",
"who'll": "who will",
"who's": "who is",
"won't": "will not",
"wouldn't": "would not",
"you'd": "you would",
"you'll": "you will",
"you're": "you are"
}
3.2数据清洗
def clean_text(text, remove_stopwords = True):
    '''Remove unwanted characters, stopwords, and format the text to create fewer nulls word embeddings'''
    
    # Convert words to lower case
    text = text.lower()
    
    # Replace contractions with their longer forms 
    if True:
        text = text.split()
        new_text = []
        for word in text:
            if word in contractions:
                new_text.append(contractions[word])
            else:
                new_text.append(word)
        text = " ".join(new_text)
    
    # Format words and remove unwanted characters
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\<a href', ' ', text)
    text = re.sub(r'&amp;', '', text) 
    text = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', text)
    text = re.sub(r'<br />', ' ', text)
    text = re.sub(r'\'', ' ', text)
    
    # Optionally, remove stop words
    if remove_stopwords:
        text = text.split()
        stops = set(stopwords.words("english"))
        text = [w for w in text if not w in stops]
        text = " ".join(text)

    return text

↑我们将删除文本中的停用词,因为它们不能用于训练我们的模型。 但是,我们会将它们保留为摘要,以便它们听起来更像自然短语。

# Clean the summaries and texts
clean_summaries = []
for summary in reviews.Summary:
    clean_summaries.append(clean_text(summary, remove_stopwords=False))
print("Summaries are complete.")

clean_texts = []
for text in reviews.Text:
    clean_texts.append(clean_text(text))
print("Texts are complete.")

检查已清理的摘要和文本,确保它们已被清理干净

for i in range(5):
    print("Clean Review #",i+1)
    print(clean_summaries[i])
    print(clean_texts[i])
    print()

计算一组文本中每个单词的出现次数

def count_words(count_dict, text):
    '''Count the number of occurrences of each word in a set of text'''
    for sentence in text:
        for word in sentence.split():
            if word not in count_dict:
                count_dict[word] = 1
            else:
                count_dict[word] += 1

查找每个单词的使用次数和词汇量的大小

word_counts = {}

count_words(word_counts, clean_summaries)
count_words(word_counts, clean_texts)
            
print("Size of Vocabulary:", len(word_counts))

结果:
Size of Vocabulary: 132884

4.使用构建好的词向量

这里使用目前效果较好,别人已构建好的词向量

# 加载Conceptnet Numberbatch(CN)嵌入,类似于GloVe,但可能更好
# (https://github.com/commonsense/conceptnet-numberbatch)  这里使用别人已经训练好的词向量ConceptNet
embeddings_index = {}
with open('numberbatch-en-17.04b.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split(' ')
        word = values[0]
        embedding = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = embedding

print('Word embeddings:', len(embeddings_index))

词库总词向量为:
484557

4.1但是有些词在我们当前使用的语料库中是不存在的,那么这时候就需要我们自己去做word embedding.
# Find the number of words that are missing from CN, and are used more than our threshold.embedding.
missing_words = 0
threshold = 20

for word, count in word_counts.items():
    if count > threshold:
        if word not in embeddings_index:
            missing_words += 1
            
missing_ratio = round(missing_words/len(word_counts),4)*100
            
print("Number of words missing from CN:", missing_words)
print("Percent of words that are missing from vocabulary: {}%".format(missing_ratio))

结果为:
Number of words missing from CN: 3044
Percent of words that are missing from vocabulary: 2.29%

阈值设置为20,不在词向量中的且出现超过20次,那咱们就得自己做它的映射向量了

4.2将单词转换为整数的字典
# Limit the vocab that we will use to words that appear ≥ threshold or are in GloVe

#dictionary to convert words to integers 这里做了将词到int类型的映射,方便在训练和测试的时候,词的转换的操作
vocab_to_int = {} 

value = 0
for word, count in word_counts.items():
    if count >= threshold or word in embeddings_index:
        vocab_to_int[word] = value
        value += 1

# Special tokens that will be added to our vocab
codes = ["<UNK>","<PAD>","<EOS>","<GO>"]   

# Add codes to vocab
for code in codes:
    vocab_to_int[code] = len(vocab_to_int)

# Dictionary to convert integers to words
int_to_vocab = {}
for word, value in vocab_to_int.items():
    int_to_vocab[value] = word

usage_ratio = round(len(vocab_to_int) / len(word_counts),4)*100

print("Total number of unique words:", len(word_counts))
print("Number of words we will use:", len(vocab_to_int))
print("Percent of words we will use: {}%".format(usage_ratio))

结果为:
Total number of unique words: 132884
Number of words we will use: 65469
Percent of words we will use: 49.27%

4.3设置词向量维度
# Need to use 300 for embedding dimensions to match CN's vectors.
embedding_dim = 300  # 因为使用的是别人已经训练好的词向量,且他们设置的词向量的维度是300维,这里指定自己的维度也是300维,确保保持一致
nb_words = len(vocab_to_int)

# Create matrix with default values of zero
word_embedding_matrix = np.zeros((nb_words, embedding_dim), dtype=np.float32)
for word, i in vocab_to_int.items():
    if word in embeddings_index:
        word_embedding_matrix[i] = embeddings_index[word]
    else:
        # If word not in CN, create a random embedding for it
        new_embedding = np.array(np.random.uniform(-1.0, 1.0, embedding_dim))
        embeddings_index[word] = new_embedding
        word_embedding_matrix[i] = new_embedding

# Check if value matches len(vocab_to_int)
print(len(word_embedding_matrix))  # 65469
4.4将文本中的单词转换为整数。
def convert_to_ints(text, word_count, unk_count, eos=False):
    '''Convert words in text to an integer.。
       If word is not in vocab_to_int, use UNK's integer.如果word不在vocab_to_int中,请使用UNK的整数
       Total the number of words and UNKs.单词和UNK的总数。
       Add EOS token to the end of texts 将EOS token添加到文本末尾'''
    ints = []
    for sentence in text:
        sentence_ints = []
        for word in sentence.split():
            word_count += 1
            if word in vocab_to_int:
                sentence_ints.append(vocab_to_int[word])
            else:
                sentence_ints.append(vocab_to_int["<UNK>"])
                unk_count += 1
        if eos:
            sentence_ints.append(vocab_to_int["<EOS>"])
        ints.append(sentence_ints)
    return ints, word_count, unk_count
4.5将convert_to_ints应用于clean_summaries和clean_texts
# Apply convert_to_ints to clean_summaries and clean_texts
word_count = 0
unk_count = 0

int_summaries, word_count, unk_count = convert_to_ints(clean_summaries, word_count, unk_count)
int_texts, word_count, unk_count = convert_to_ints(clean_texts, word_count, unk_count, eos=True)

unk_percent = round(unk_count/word_count,4)*100

print("Total number of words in headlines:", word_count)
print("Total number of UNKs in headlines:", unk_count)
print("Percent of words that are UNK: {}%".format(unk_percent))

结果为:
Total number of words in headlines: 25679946
Total number of UNKs in headlines: 170450
Percent of words that are UNK: 0.66%

4.6从文本中创建句子长度的DataFrame
def create_lengths(text):  # 因为语料库中词的长度不一致,要做padding,所以这里先统计每个sentence长度
    '''Create a data frame of the sentence lengths from a text'''
    lengths = []
    for sentence in text:
        lengths.append(len(sentence))
    return pd.DataFrame(lengths, columns=['counts'])
lengths_summaries = create_lengths(int_summaries)
lengths_texts = create_lengths(int_texts)

print("Summaries:")
print(lengths_summaries.describe())
print()
print("Texts:")
print(lengths_texts.describe())

结果为:
Summaries:
counts
count 568412.000000
mean 4.181620
std 2.657872
min 0.000000
25% 2.000000
50% 4.000000
75% 5.000000
max 48.000000

Texts:
counts
count 568412.000000
mean 41.996782
std 42.520854
min 1.000000
25% 18.000000
50% 29.000000
75% 50.000000
max 2085.000000

# Inspect the length of texts 统计百分比
print(np.percentile(lengths_texts.counts, 90))
print(np.percentile(lengths_texts.counts, 95))
print(np.percentile(lengths_texts.counts, 99))

84.0
115.0
207.0

# Inspect the length of summaries  检查摘要的长度
print(np.percentile(lengths_summaries.counts, 90))
print(np.percentile(lengths_summaries.counts, 95))
print(np.percentile(lengths_summaries.counts, 99))

8.0
9.0
13.0

4.7计算UNK出现在句子中的次数
def unk_counter(sentence):
    '''Counts the number of times UNK appears in a sentence.'''
    unk_count = 0
    for word in sentence:
        if word == vocab_to_int["<UNK>"]:
            unk_count += 1
    return unk_count
4.8文本排序,设置范围
# Sort the summaries and texts by the length of the texts, shortest to longest  按文本长度对摘要和文本进行排序,最短到最长
# Limit the length of summaries and texts based on the min and max ranges.根据最小和最大范围限制摘要和文本的长度
# Remove reviews that include too many UNKs删除包含太多UNK的评论

sorted_summaries = []
sorted_texts = []
max_text_length = 84
max_summary_length = 13
min_length = 2
unk_text_limit = 1
unk_summary_limit = 0

for length in range(min(lengths_texts.counts), max_text_length): 
    for count, words in enumerate(int_summaries):
        if (len(int_summaries[count]) >= min_length and
            len(int_summaries[count]) <= max_summary_length and
            len(int_texts[count]) >= min_length and
            unk_counter(int_summaries[count]) <= unk_summary_limit and
            unk_counter(int_texts[count]) <= unk_text_limit and
            length == len(int_texts[count])
           ):
            sorted_summaries.append(int_summaries[count])
            sorted_texts.append(int_texts[count])
        
# Compare lengths to ensure they match
print(len(sorted_summaries))
print(len(sorted_texts))

5.构建Seq2Seq模型

这里使用的是RNN的变种-Bidirectional RNNs,Bidirectional RNNs(双向网络)的改进之处便是,假设当前的输出(第t步的输出)不仅仅与前面的序列有关,并且还与后面的序列有关。

例如:预测一个语句中缺失的词语那么就需要根据上下文来进行预测。Bidirectional RNNs是一个相对较简单的RNNs,是由两个RNNs上下叠加在一起组成的。输出由这两个RNNs的隐藏层的状态决定的


Bidirectional RNNs
5.1输入层
5.1.1设置模型输入,为模型的输入创建占位符
def model_inputs():
    '''Create palceholders for inputs to the model'''
    
    input_data = tf.placeholder(tf.int32, [None, None], name='input')
    targets = tf.placeholder(tf.int32, [None, None], name='targets')
    lr = tf.placeholder(tf.float32, name='learning_rate')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    summary_length = tf.placeholder(tf.int32, (None,), name='summary_length')
    max_summary_length = tf.reduce_max(summary_length, name='max_dec_len')
    text_length = tf.placeholder(tf.int32, (None,), name='text_length')

    return input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length
5.2将<GO>插入,便于批处理和训练
def process_encoding_input(target_data, vocab_to_int, batch_size):
    '''Remove the last word id from each batch and concat the <GO> to the begining of each batch
      从每个批次中删除最后一个单词id,并将<GO>连接到每个批次的开头'''
    
    ending = tf.strided_slice(target_data, [0, 0], [batch_size, -1], [1, 1])
    dec_input = tf.concat([tf.fill([batch_size, 1], vocab_to_int['<GO>']), ending], 1)

    return dec_input
5.2编码层
5.2.1创建编码层
def encoding_layer(rnn_size, sequence_length, num_layers, rnn_inputs, keep_prob):
    '''Create the encoding layer双向RNN,就是由两个RNN网络组织成的'''
    
    for layer in range(num_layers):
        with tf.variable_scope('encoder_{}'.format(layer)):
            cell_fw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_fw = tf.contrib.rnn.DropoutWrapper(cell_fw, 
                                                    input_keep_prob = keep_prob)

            cell_bw = tf.contrib.rnn.LSTMCell(rnn_size,
                                              initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            cell_bw = tf.contrib.rnn.DropoutWrapper(cell_bw, 
                                                    input_keep_prob = keep_prob)

            enc_output, enc_state = tf.nn.bidirectional_dynamic_rnn(cell_fw, 
                                                                    cell_bw, 
                                                                    rnn_inputs,
                                                                    sequence_length,
                                                                    dtype=tf.float32)
    # Join outputs since we are using a bidirectional RNN
    enc_output = tf.concat(enc_output,2)
    
    return enc_output, enc_state
5.2.2训练解码层
def training_decoding_layer(dec_embed_input, summary_length, dec_cell, initial_state, output_layer, 
                            vocab_size, max_summary_length):
    '''Create the training logits
      logits: 未归一化的概率, 一般也就是 softmax层的输入。所以logits和lables的shape一样'''
    
    training_helper = tf.contrib.seq2seq.TrainingHelper(inputs=dec_embed_input,
                                                        sequence_length=summary_length,
                                                        time_major=False)

    training_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                       training_helper,
                                                       initial_state,
                                                       output_layer) 

    training_logits, _ = tf.contrib.seq2seq.dynamic_decode(training_decoder,
                                                           output_time_major=False,
                                                           impute_finished=True,
                                                           maximum_iterations=max_summary_length)
    return training_logits
5.2.3预测解码层
def inference_decoding_layer(embeddings, start_token, end_token, dec_cell, initial_state, output_layer,
                             max_summary_length, batch_size):
    '''Create the inference logits'''
    
    start_tokens = tf.tile(tf.constant([start_token], dtype=tf.int32), [batch_size], name='start_tokens')
    
    inference_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embeddings,
                                                                start_tokens,
                                                                end_token)
                
    inference_decoder = tf.contrib.seq2seq.BasicDecoder(dec_cell,
                                                        inference_helper,
                                                        initial_state,
                                                        output_layer)
                
    inference_logits, _ = tf.contrib.seq2seq.dynamic_decode(inference_decoder,
                                                            output_time_major=False,
                                                            impute_finished=True,
                                                            maximum_iterations=max_summary_length)
    
    return inference_logits
5.3解码层
def decoding_layer(dec_embed_input, embeddings, enc_output, enc_state, vocab_size, text_length, summary_length, 
                   max_summary_length, rnn_size, vocab_to_int, keep_prob, batch_size, num_layers):
    '''Create the decoding cell and attention for the training and inference decoding layers
      为训练和预测解码层创建解码单元和注意力机制'''
    
    for layer in range(num_layers):
        with tf.variable_scope('decoder_{}'.format(layer)):
            lstm = tf.contrib.rnn.LSTMCell(rnn_size,
                                           initializer=tf.random_uniform_initializer(-0.1, 0.1, seed=2))
            dec_cell = tf.contrib.rnn.DropoutWrapper(lstm, 
                                                     input_keep_prob = keep_prob)
    
    output_layer = Dense(vocab_size,
                         kernel_initializer = tf.truncated_normal_initializer(mean = 0.0, stddev=0.1))
    
    attn_mech = tf.contrib.seq2seq.BahdanauAttention(rnn_size,
                                                  enc_output,
                                                  text_length,
                                                  normalize=False,
                                                  name='BahdanauAttention')

    dec_cell = tf.contrib.seq2seq.DynamicAttentionWrapper(dec_cell,
                                                          attn_mech,
                                                          rnn_size)
            
    initial_state = tf.contrib.seq2seq.DynamicAttentionWrapperState(enc_state[0],
                                                                    _zero_state_tensors(rnn_size, 
                                                                                        batch_size, 
                                                                                        tf.float32)) 
    with tf.variable_scope("decode"):
        training_logits = training_decoding_layer(dec_embed_input, 
                                                  summary_length, 
                                                  dec_cell, 
                                                  initial_state,
                                                  output_layer,
                                                  vocab_size, 
                                                  max_summary_length)
    with tf.variable_scope("decode", reuse=True):
        inference_logits = inference_decoding_layer(embeddings,  
                                                    vocab_to_int['<GO>'], 
                                                    vocab_to_int['<EOS>'],
                                                    dec_cell, 
                                                    initial_state, 
                                                    output_layer,
                                                    max_summary_length,
                                                    batch_size)

    return training_logits, inference_logits
5.4组合Seq2Seq模型
def seq2seq_model(input_data, target_data, keep_prob, text_length, summary_length, max_summary_length, 
                  vocab_size, rnn_size, num_layers, vocab_to_int, batch_size):
    '''Use the previous functions to create the training and inference logits
      使用之前的函数创建训练和预测logits'''
    
    # Use Numberbatch's embeddings and the newly created ones as our embeddings
    embeddings = word_embedding_matrix  # 矩阵
    
    enc_embed_input = tf.nn.embedding_lookup(embeddings, input_data)
    enc_output, enc_state = encoding_layer(rnn_size, text_length, num_layers, enc_embed_input, keep_prob)
    
    dec_input = process_encoding_input(target_data, vocab_to_int, batch_size)
    dec_embed_input = tf.nn.embedding_lookup(embeddings, dec_input)
    
    training_logits, inference_logits  = decoding_layer(dec_embed_input, 
                                                        embeddings,
                                                        enc_output,
                                                        enc_state, 
                                                        vocab_size, 
                                                        text_length, 
                                                        summary_length, 
                                                        max_summary_length,
                                                        rnn_size, 
                                                        vocab_to_int, 
                                                        keep_prob, 
                                                        batch_size,
                                                        num_layers)
    
    return training_logits, inference_logits
5.5批处理文本句子
5.5.1填充句子,让句子的长度达到一致
def pad_sentence_batch(sentence_batch):
    """Pad sentences with <PAD> so that each sentence of a batch has the same length
      使用<PAD>填充句子,以便批处理中的每个句子具有相同的长度"""
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [vocab_to_int['<PAD>']] * (max_sentence - len(sentence)) for sentence in sentence_batch]
5.5.2批量处理摘要,文本和句子的长度
def get_batches(summaries, texts, batch_size):
    """Batch summaries, texts, and the lengths of their sentences together"""
    for batch_i in range(0, len(texts)//batch_size):
        start_i = batch_i * batch_size
        summaries_batch = summaries[start_i:start_i + batch_size]
        texts_batch = texts[start_i:start_i + batch_size]
        pad_summaries_batch = np.array(pad_sentence_batch(summaries_batch))
        pad_texts_batch = np.array(pad_sentence_batch(texts_batch))
        
        # Need the lengths for the _lengths parameters
        pad_summaries_lengths = []
        for summary in pad_summaries_batch:
            pad_summaries_lengths.append(len(summary))
        
        pad_texts_lengths = []
        for text in pad_texts_batch:
            pad_texts_lengths.append(len(text))
        
        yield pad_summaries_batch, pad_texts_batch, pad_summaries_lengths, pad_texts_lengths
5.5设置超参数
# Set the Hyperparameters
epochs = 100
batch_size = 64
rnn_size = 256
num_layers = 2
learning_rate = 0.005
keep_probability = 0.75
5.6在TensorFlow构建模型需要的图来进行计算
# Build the graph
train_graph = tf.Graph()
# Set the graph to default to ensure that it is ready for training将图表设置为默认,以确保它已准备好进行训练
with train_graph.as_default():
    
    # Load the model inputs    
    input_data, targets, lr, keep_prob, summary_length, max_summary_length, text_length = model_inputs()

    # Create the training and inference logits
    training_logits, inference_logits = seq2seq_model(tf.reverse(input_data, [-1]),
                                                      targets, 
                                                      keep_prob,   
                                                      text_length,
                                                      summary_length,
                                                      max_summary_length,
                                                      len(vocab_to_int)+1,
                                                      rnn_size, 
                                                      num_layers, 
                                                      vocab_to_int,
                                                      batch_size)
    
    # Create tensors for the training logits and inference logits
    training_logits = tf.identity(training_logits.rnn_output, 'logits')
    inference_logits = tf.identity(inference_logits.sample_id, name='predictions')
    
    # Create the weights for sequence_loss
    masks = tf.sequence_mask(summary_length, max_summary_length, dtype=tf.float32, name='masks')

    with tf.name_scope("optimization"):
        # Loss function
        cost = tf.contrib.seq2seq.sequence_loss(
            training_logits,
            targets,
            masks)

        # Optimizer
        optimizer = tf.train.AdamOptimizer(learning_rate)

        # Gradient Clipping
        gradients = optimizer.compute_gradients(cost)
        capped_gradients = [(tf.clip_by_value(grad, -5., 5.), var) for grad, var in gradients if grad is not None]
        train_op = optimizer.apply_gradients(capped_gradients)
print("Graph is built.")

6.训练网络

6.1训练数据子集
# Subset the data for training
start = 200000
end = start + 50000
sorted_summaries_short = sorted_summaries[start:end]
sorted_texts_short = sorted_texts[start:end]
print("The shortest text length:", len(sorted_texts_short[0]))  # The shortest text length: 25
print("The longest text length:",len(sorted_texts_short[-1]))  # The longest text length: 31
6.2训练模型
# Train the Model
learning_rate_decay = 0.95
min_learning_rate = 0.0005
display_step = 20 # Check training loss after every 20 batches
stop_early = 0 
stop = 3 # If the update loss does not decrease in 3 consecutive update checks, stop training
per_epoch = 3 # Make 3 update checks per epoch
update_check = (len(sorted_texts_short)//batch_size//per_epoch)-1

update_loss = 0 
batch_loss = 0
summary_update_loss = [] # Record the update losses for saving improvements in the model

checkpoint = "best_model.ckpt" 
with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    
    # If we want to continue training a previous session
    #loader = tf.train.import_meta_graph("./" + checkpoint + '.meta')
    #loader.restore(sess, checkpoint)
    
    for epoch_i in range(1, epochs+1):
        update_loss = 0
        batch_loss = 0
        for batch_i, (summaries_batch, texts_batch, summaries_lengths, texts_lengths) in enumerate(
                get_batches(sorted_summaries_short, sorted_texts_short, batch_size)):
            start_time = time.time()
            _, loss = sess.run(
                [train_op, cost],
                {input_data: texts_batch,
                 targets: summaries_batch,
                 lr: learning_rate,
                 summary_length: summaries_lengths,
                 text_length: texts_lengths,
                 keep_prob: keep_probability})

            batch_loss += loss
            update_loss += loss
            end_time = time.time()
            batch_time = end_time - start_time

            if batch_i % display_step == 0 and batch_i > 0:
                print('Epoch {:>3}/{} Batch {:>4}/{} - Loss: {:>6.3f}, Seconds: {:>4.2f}'
                      .format(epoch_i,
                              epochs, 
                              batch_i, 
                              len(sorted_texts_short) // batch_size, 
                              batch_loss / display_step, 
                              batch_time*display_step))
                batch_loss = 0

            if batch_i % update_check == 0 and batch_i > 0:
                print("Average loss for this update:", round(update_loss/update_check,3))
                summary_update_loss.append(update_loss)
                
                # If the update loss is at a new minimum, save the model
                if update_loss <= min(summary_update_loss):
                    print('New Record!') 
                    stop_early = 0
                    saver = tf.train.Saver() 
                    saver.save(sess, checkpoint)

                else:
                    print("No Improvement.")
                    stop_early += 1
                    if stop_early == stop:
                        break
                update_loss = 0
            
                    
        # Reduce learning rate, but not below its minimum value
        learning_rate *= learning_rate_decay
        if learning_rate < min_learning_rate:
            learning_rate = min_learning_rate
        
        if stop_early == stop:
            print("Stopping Training.")
            break

7.测试模型

7.1为模型准备文本语料
def text_to_seq(text):
    '''Prepare the text for the model'''
    text = clean_text(text)
    return [vocab_to_int.get(word, vocab_to_int['<UNK>']) for word in text.split()]
7.2输入语料,进行测试
# Create your own review or use one from the dataset,创建自己的评论或使用数据集中的评论
#input_sentence = "I have never eaten an apple before, but this red one was nice. \
                  #I think that I will try a green apple next time."
#text = text_to_seq(input_sentence)
random = np.random.randint(0,len(clean_texts))
input_sentence = clean_texts[random]
text = text_to_seq(clean_texts[random])

checkpoint = "./best_model.ckpt"

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(checkpoint + '.meta')
    loader.restore(sess, checkpoint)

    input_data = loaded_graph.get_tensor_by_name('input:0')
    logits = loaded_graph.get_tensor_by_name('predictions:0')
    text_length = loaded_graph.get_tensor_by_name('text_length:0')
    summary_length = loaded_graph.get_tensor_by_name('summary_length:0')
    keep_prob = loaded_graph.get_tensor_by_name('keep_prob:0')
    
    #Multiply by batch_size to match the model's input parameters
    answer_logits = sess.run(logits, {input_data: [text]*batch_size, 
                                      summary_length: [np.random.randint(5,8)], 
                                      text_length: [len(text)]*batch_size,
                                      keep_prob: 1.0})[0] 

# Remove the padding from the tweet
pad = vocab_to_int["<PAD>"] 

print('Original Text:', input_sentence)

print('\nText')
print('  Word Ids:    {}'.format([i for i in text]))
print('  Input Words: {}'.format(" ".join([int_to_vocab[i] for i in text])))

print('\nSummary')
print('  Word Ids:       {}'.format([i for i in answer_logits if i != pad]))
print('  Response Words: {}'.format(" ".join([int_to_vocab[i] for i in answer_logits if i != pad])))

结果为:
INFO:tensorflow:Restoring parameters from ./best_model.ckpt
Original Text: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets

Text
Word Ids: [70595, 18808, 668, 45565, 51927, 51759, 32488, 13510, 32036, 59599, 11693, 444, 23335, 32036, 59599, 51927, 67316, 726, 24842, 50494, 48492, 1062, 44749, 38443, 42344, 67973, 14168, 7759, 5347, 29528, 58763, 18927, 17701, 20232, 47328]
Input Words: love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets

Summary
Word Ids: [70595, 28738]
Response Words: love it

Examples of reviews and summaries:
  • Review(1): The coffee tasted great and was at such a good price! I highly recommend this to everyone!
  • Summary(1): great coffee
  • Review(2): This is the worst cheese that I have ever bought! I will never buy it again and I hope you won't either!
  • Summary(2): omg gross gross
  • Review(3): love individual oatmeal cups found years ago sam quit selling sound big lots quit selling found target expensive buy individually trilled get entire case time go anywhere need water microwave spoon know quaker flavor packets
  • Summary(3): love it
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 159,835评论 4 364
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 67,598评论 1 295
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 109,569评论 0 244
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 44,159评论 0 213
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 52,533评论 3 287
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 40,710评论 1 222
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,923评论 2 313
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 30,674评论 0 203
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 34,421评论 1 246
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 30,622评论 2 245
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 32,115评论 1 260
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 28,428评论 2 254
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 33,114评论 3 238
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 26,097评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,875评论 0 197
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 35,753评论 2 276
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 35,649评论 2 271

推荐阅读更多精彩内容