基于RNN实现古诗词生成模型

我们知道，RNN（循环神经网络）模型是基于当前的状态和当前的输入来对下一时刻做出预判。而LSTM（长短时记忆网络）模型则可以记忆距离当前位置较远的上下文信息。
在此，我们根据上述预判模型来进行古诗词的生成模型训练。
首先，我们需要准备好古诗词的数据集：全唐诗共34646首，我把数据文件上传到了我的csdn中，又需要的可以下载
http://download.csdn.net/download/qq_34470213/10150761

训练模型

1、获取字典

我们首先需要读取诗集，把诗集的每首诗都分离出来存入列表，根据列表的长度就可以得出共有多少首古诗。

首先需要把每首诗读出来，故可以使用open函数。

由于在数据文件中每首诗的格式都是（题目：内容），所以可以先使用strip函数去掉空格，再使用split（“：”）来分割题目和内容，由于我们在这里只需要使用诗的内容，所以只保存内容即可。

得到了诗点的内容，需要注意的是有些诗句的题目中也会含有“：”符号，我们需要把这样的句子省略掉，因为它不是诗词内容。
得到了所有的诗词内容。

为了标记诗词的开始和结尾，我们在开头加上字符“[”，末尾加上字符“]”，在训练的时候程序也会根据该符号来作为训练的始末状态。
把所有的唐诗内容都加入到列表中，列表长度即为唐诗的总数。

代码实现：

poetrys = []
with open(poetry_file, "r", encoding='utf-8', ) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ', '')
            if '_' in content or '(' in content or '（' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

poetrys = sorted(poetrys, key=lambda line: len(line))
print('唐诗总数: ', len(poetrys))

得到所有唐诗内容以后，就可以对每个字进行编码了，由此得到所有诗的编码形式，把编码放入神经网络进行训练。

则需要把所有的诗词中所有出现过的字都进行统计，统计其出现过的次数，使用collection.Counter对一个列表中的每个元素都进行遍历统计，返回值为一个元素和出现次数相对应的字典。

我们取有训练必要的数据进行编码，首先根据字典中的出现次数以由高到低的顺序进行排序，可以使用sorted函数，key表示排序方法，k=lambda x:x[1]，表示根据第二个参数（即出现次数）的大小从大到小排序，设置为-x[1]排序后则是从大到小。

取出需要编码的字，按照从0开始的编码格式，对每个字进行编码，排序后我们得到了具有每个字和其出现次数的元组，我们只需要拿到每个字即可。
zip([1,2],[3,4],[5,6])
-- 》 [1,3,5],[2,4,6]
zip(*[(1,2),(3,4),(5,6)])
--》[1,3,5], [2,4,6]

选择出现次数多的字进行编码，作为编码字典。把每个字与从0到len的数字编码字典
dict(d)：创建一个字典。d 必须是一个序列 (key,value)元组
最后得到每个字与从0开始的字符组成的字典

把每首诗的每个字都进行编码处理，即从字典中找到每个字对应的号码
dict.get(key, default=None)
key -- 字典中要查找的键。
default -- 如果指定键的值不存在时，返回该默认值。

代码实现

all_words = []
for poetry in poetrys:
    all_words += [word for word in poetry]
counter = collections.Counter(all_words)

count_pairs = sorted(counter.items(), key=lambda x: -x[1])
words, _ = zip(*count_pairs)
leng = int(len(words)*0.9)
words = words[:leng]+(' ',)

word_num_map = dict(zip(words, range(len(words))))
to_num = lambda word: word_num_map.get(word, len(words))

poetrys_vector = [list(map(to_num, poetry)) for poetry in poetrys]

训练数据

训练时每次取64首诗进行训练，即每次在列表内取64个数据，然后对其进行输出数据x，输出数据y进行赋值，y为正确的结果，用于训练。（需注意的是，由于模型的作用是对下一个字进行预测，所以y只是x的数据向前移动一个字）
定义一个RNN模型，然后把数据代入进行训练，使用RNN进行训练的过程大约分为：
1、定义模型和结构。
2、0初始化当前状态。
3、输入数据进行ID到单词向量的转化。
4、输入数据和初始化状态代入模型进行训练，得到训练结果。
5、对训练结果加入一个全连接层得到最终输出。
多次训练，得到最终的状态和最终的损失。在本例中，共规定了50次训练，每次训练都对每个batche数据进行训练，由于共有34646首诗，每个batche的大小为64，所以共有541个batche

 for epoch in range(50):
            for batche in range(541):
                    train(epoch, batche)

由于最后的输出数据是下一个字，所以输出格式的大小为该字可能对应的编码，输出大小为len。

为了防止中断，及时保存。

生成古诗：
使用以上训练好的网络模型来生成新的古诗，生成古诗的主要方法有：
读取模板文件，对每个字的出现个数都进行统计，根据统计结果取出数据来进行编码，得到每个字和相应的编码字典。用于字和编码之间的转化。
生成RNN模型网络，应用于根据输入信息得到相应的输出信息。与训练模型的编写方法相同。
读取已保存的网络模型，根据已经训练好的模型来进行新的数据预测。
使用循环语句进行编码和字之间的转化，直到一首诗做完后退出。

训练数据的总代码：

import collections
import numpy as np
from tensorflow.contrib.legacy_seq2seq.python.ops.seq2seq import sequence_loss_by_example
import tensorflow as tf
import os

MODEL_SAVE_PATH = "./save/"
MODEL_NAME = "poetry.module"

# -------------------------------数据预处理---------------------------#

poetry_file = 'poetry.txt'

# 诗集
poetrys = []
with open(poetry_file, "r", encoding='utf-8', ) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ', '')
            if '_' in content or '(' in content or '（' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

poetrys = sorted(poetrys, key=lambda line: len(line))
print('唐诗总数: ', len(poetrys))

all_words = []
for poetry in poetrys:
    all_words += [word for word in poetry]
counter = collections.Counter(all_words)
print(counter)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
print(count_pairs)
words, _ = zip(*count_pairs)
print(words)
print(len(words))
leng = int(len(words)*0.9)

words = words[:leng]+(' ',)
print(words)

word_num_map = dict(zip(words, range(len(words))))

to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [list(map(to_num, poetry)) for poetry in poetrys]
# [[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
# [339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
# ....]

# 每次取64首诗进行训练
batch_size = 64
n_chunk = len(poetrys_vector) // batch_size
x_batches = []
y_batches = []

for i in range(n_chunk):
    start_index = i * batch_size
    end_index = start_index + batch_size

    batches = poetrys_vector[start_index:end_index]
    length = max(map(len, batches))
    xdata = np.full((batch_size, length), word_num_map[' '], np.int32)
    for row in range(batch_size):
        xdata[row, :len(batches[row])] = batches[row]
    ydata = np.copy(xdata)
    ydata[:, :-1] = xdata[:, 1:]
    """
    xdata             ydata
    [6,2,4,6,9]       [2,4,6,9,9]
    [1,4,2,8,5]       [4,2,8,5,5]
    """
    x_batches.append(xdata)
    y_batches.append(ydata)

# ---------------------------------------RNN--------------------------------------#

input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])


# 定义RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
    if model == 'rnn':
        cell_fun = tf.nn.rnn_cell.BasicRNNCell
    elif model == 'gru':
        cell_fun = tf.nn.rnn_cell.GRUCell
    elif model == 'lstm':
        cell_fun = tf.nn.rnn_cell.BasicLSTMCell

    cell = cell_fun(rnn_size, state_is_tuple=True)
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    initial_state = cell.zero_state(batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words) + 1])
        softmax_b = tf.get_variable("softmax_b", [len(words) + 1])
        with tf.device("/cpu:0"):
            embedding = tf.get_variable("embedding", [len(words) + 1, rnn_size])
            inputs = tf.nn.embedding_lookup(embedding, input_data)

    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
    output = tf.reshape(outputs, [-1, rnn_size])

    logits = tf.matmul(output, softmax_w) + softmax_b
    probs = tf.nn.softmax(logits)
    return logits, last_state, probs, cell, initial_state


# 训练
def train_neural_network():
    logits, last_state, _, _, _ = neural_network()
    targets = tf.reshape(output_targets, [-1])
    loss = sequence_loss_by_example([logits], [targets], [tf.ones_like(targets, dtype=tf.float32)], len(words))
    cost = tf.reduce_mean(loss)
    learning_rate = tf.Variable(0.0, trainable=False)
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.apply_gradients(zip(grads, tvars))

    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        # saver = tf.train.Saver()
        for epoch in range(50):
            sess.run(tf.assign(learning_rate, 0.002 * (0.97 ** epoch)))
            n = 0
            for batche in range(n_chunk):
                train_loss, _, _ = sess.run([cost, last_state, train_op],
                                            feed_dict={input_data: x_batches[n], output_targets: y_batches[n]})
                n += 1
                print(epoch, batche, train_loss)
                if epoch % 7 == 0:
                     saver.save(sess, os.path.join(MODEL_SAVE_PATH, MODEL_NAME), global_step=epoch)

train_neural_network()

训练结束后得到储存神经网络模型的文件：

我的笔记本上训练了十个多小时，如果不想训练，可以直接下载我训练好的文件来使用，可以得到同样的效果。
我把训练的最后结果放到了这里，链接：https://pan.baidu.com/s/1bIibbo 密码：ojs3

使用模型生成诗句

使用模型时首先应该加载出该模型使我们方便使用。
已知一首诗的开始标志字为"["，设其初始状态为0，由此开始载入模型，迭代可以求得整首古诗，古诗的结束标志为"]"，出现了此输出结果表示古诗生成完毕，退出循环，打印结果。

import collections
import numpy as np
import tensorflow as tf

#-------------------------------数据预处理---------------------------#

poetry_file ='poetry.txt'

# 诗集
poetrys = []
with open(poetry_file, "r", encoding='utf-8',) as f:
    for line in f:
        try:
            title, content = line.strip().split(':')
            content = content.replace(' ','')
            if '_' in content or '(' in content or '（' in content or '《' in content or '[' in content:
                continue
            if len(content) < 5 or len(content) > 79:
                continue
            content = '[' + content + ']'
            poetrys.append(content)
        except Exception as e:
            pass

poetrys = sorted(poetrys,key=lambda line: len(line))
print('唐诗总数: ', len(poetrys))

all_words = []
for poetry in poetrys:
    all_words += [word for word in poetry]
counter = collections.Counter(all_words)
count_pairs = sorted(counter.items(), key=lambda x: -x[1])
words, _ = zip(*count_pairs)

words = words[:len(words)] + (' ',)
word_num_map = dict(zip(words, range(len(words))))
to_num = lambda word: word_num_map.get(word, len(words))
poetrys_vector = [ list(map(to_num, poetry)) for poetry in poetrys]
#[[314, 3199, 367, 1556, 26, 179, 680, 0, 3199, 41, 506, 40, 151, 4, 98, 1],
#[339, 3, 133, 31, 302, 653, 512, 0, 37, 148, 294, 25, 54, 833, 3, 1, 965, 1315, 377, 1700, 562, 21, 37, 0, 2, 1253, 21, 36, 264, 877, 809, 1]
#....]

batch_size = 1
n_chunk = len(poetrys_vector) // batch_size
x_batches = []
y_batches = []
for i in range(n_chunk):
    start_index = i * batch_size
    end_index = start_index + batch_size

    batches = poetrys_vector[start_index:end_index]
    length = max(map(len,batches))
    xdata = np.full((batch_size,length), word_num_map[' '], np.int32)
    for row in range(batch_size):
        xdata[row,:len(batches[row])] = batches[row]
    ydata = np.copy(xdata)
    ydata[:,:-1] = xdata[:,1:]
    """
    xdata             ydata
    [6,2,4,6,9]       [2,4,6,9,9]
    [1,4,2,8,5]       [4,2,8,5,5]
    """
    x_batches.append(xdata)
    y_batches.append(ydata)


#---------------------------------------RNN--------------------------------------#

input_data = tf.placeholder(tf.int32, [batch_size, None])
output_targets = tf.placeholder(tf.int32, [batch_size, None])
# 定义RNN
def neural_network(model='lstm', rnn_size=128, num_layers=2):
    if model == 'rnn':
        cell_fun = tf.nn.rnn_cell.BasicRNNCell
    elif model == 'gru':
        cell_fun = tf.nn.rnn_cell.GRUCell
    elif model == 'lstm':
        cell_fun = tf.nn.rnn_cell.BasicLSTMCell

    cell = cell_fun(rnn_size, state_is_tuple=True)
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    initial_state = cell.zero_state(batch_size, tf.float32)

    with tf.variable_scope('rnnlm'):
        softmax_w = tf.get_variable("softmax_w", [rnn_size, len(words)+1])
        softmax_b = tf.get_variable("softmax_b", [len(words)+1])
        with tf.device("/cpu:0"):
            embedding = tf.get_variable("embedding", [len(words)+1, rnn_size])
            inputs = tf.nn.embedding_lookup(embedding, input_data)

    outputs, last_state = tf.nn.dynamic_rnn(cell, inputs, initial_state=initial_state, scope='rnnlm')
    output = tf.reshape(outputs,[-1, rnn_size])

    logits = tf.matmul(output, softmax_w) + softmax_b
    probs = tf.nn.softmax(logits)
    return logits, last_state, probs, cell, initial_state

#-------------------------------生成古诗---------------------------------#
# 使用训练完成的模型

def gen_poetry():
    def to_word(weights):
        t = np.cumsum(weights)
        s = np.sum(weights)
        sample = int(np.searchsorted(t, np.random.rand(1)*s))
        return words[sample]

    _, last_state, probs, cell, initial_state = neural_network()

    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())

        saver = tf.train.Saver(tf.all_variables())
        saver.restore(sess, './save/poetry.module-49')

        state_ = sess.run(cell.zero_state(1, tf.float32))

        x = np.array([list(map(word_num_map.get, '['))])
        [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
        word = to_word(probs_)
        
        poem = ''
        word_biao = word
        while word != ']':
            poem += word_biao
            x = np.zeros((1,1))
            x[0,0] = word_num_map[word]
            [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
            word = to_word(probs_)
            word_biao =word
            if word_biao == '。':
                word_biao = '。\n'
            print(word_biao)
        
      return poem

print(gen_poetry())

输出结果：

藏头诗的写作

藏头诗与自由作诗的区别在于，需要指定每句话的头一个字，所以初始状态便需要重新设定为给定的字，我们设置一个for循环来取出藏头句子的每
一个单字，对该单字进行训练。
我们把第一个字设置为"["，求出状态state_，然后将该状态代入该单字中求下一个字的解。即，已知当前输入为"word"，当前状态是“[”的状态state_，求输出和下一步状态。
输出作为当前输入，下一步状态作为当前状态，再求下一个字。
直到诗句满足字数状态或结束，则退出循环，处理下一个单字。

def gen_poetry_with_head_and_type(head, type):
    if type != 5 and type != 7:
        print('The second para has to be 5 or 7!')
        return

    def to_word(weights):
        t = np.cumsum(weights)
        s = np.sum(weights)
        sample = int(np.searchsorted(t, np.random.rand(1)*s))
        return words[sample]

    _, last_state, probs, cell, initial_state = neural_network()

    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        saver = tf.train.Saver()
        saver.restore(sess, './save/poetry.module-35')
        poem = ''
        i = 0

        for the_word in head:
                flag = True
                while flag:
                    state_ = sess.run(cell.zero_state(1, tf.float32))
                    x = np.array([list(map(word_num_map.get, '['))])
                    [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})

                    sentence = the_word
                    x = np.zeros((1, 1))
                    x[0, 0] = word_num_map[sentence]
                    [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})

                    word = to_word(probs_)
                    sentence += word

                    while word!='。':
                        x = np.zeros((1, 1))
                        x[0, 0] = word_num_map[word]
                        [probs_, state_] = sess.run([probs, last_state], feed_dict={input_data: x, initial_state: state_})
                        word = to_word(probs_)

                        sentence += word

                        if len(sentence) == 2 + 2 * type:
                            sentence += '\n'
                            poem += sentence
                            flag = False

        return poem

print(gen_poetry_with_head_and_type("碧影江白", 7))

经过处理后输出诗句：

最后编辑于：2017.12.09 22:02:34

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 159,290评论 4赞 363
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,399评论 1赞 294
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 109,021评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 44,034评论 0赞 207
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,412评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,651评论 1赞 219
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,902评论 2赞 313
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,605评论 0赞 199
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,339评论 1赞 246
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,586评论 2赞 246
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,076评论 1赞 261
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,400评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,060评论 3赞 236
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,083评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,851评论 0赞 195
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,685评论 2赞 274
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,595评论 2赞 270

基于RNN实现古诗词生成模型

训练模型

使用模型生成诗句

藏头诗的写作

推荐阅读更多精彩内容