word2vec的几种实现

写在前面

  • 态度决定高度!让优秀成为一种习惯!
  • 世界上没有什么事儿是加一次班解决不了的,如果有,就加两次!(- - -茂强)

word2vec

大名鼎鼎的word2vec在这里就不再解释什么了,多说无益,不太明白的就去百度google吧,下面就说一下各种实现吧

准备预料

预料

python-gensim

一个简单到爆的方式,甚至可以一行代码解决问题。

  from gensim.models import word2vec
  sentences = word2vec.Text8Corpus("C:/traindataw2v.txt")  # 加载语料
  model = word2vec.Word2Vec(sentences, size=200)  # 训练skip-gram模型; 默认window=5
  #获取“学习”的词向量
  print("学习:" + model["学习"])
  # 计算两个词的相似度/相关程度
  y1 = model.similarity("不错", "好")
  # 计算某个词的相关词列表
  y2 = model.most_similar("书", topn=20)  # 20个最相关的
  # 寻找对应关系
  print("书-不错,质量-")
  y3 = model.most_similar(['质量', '不错'], ['书'], topn=3)
  # 寻找不合群的词
  y4 = model.doesnt_match("书 书籍 教材 很".split())
  # 保存模型,以便重用
  model.save("db.model")
  # 对应的加载方式
  model = word2vec.Word2Vec.load("db.model")

好了,gensim的方式说完了
下边就让我们看一下参数吧
默认参数如下:

  sentences=None
  size=100
  alpha=0.025
  window=5
  min_count=5
  max_vocab_size=None
  sample=1e-3
  seed=1
  workers=3
  min_alpha=0.0001
  sg=0
  hs=0
  negative=5
  cbow_mean=1
  hashfxn=hash
  iter=5
  null_word=0
  trim_rule=None
  sorted_vocab=1
  batch_words=MAX_WORDS_IN_BATCH

是不是感觉很意外,为啥有这么多参数,平时都不怎么用,但是,一个训练好的模型的好与坏与其参数密不可分,之所以代码把这些参数开放出来,是有一定的意义的,下面就让我们来一一的看一下各个参数的意义在哪里吧。
sentences:就是每一行每一行的句子,但是句子长度不要过大,简单的说就是上图的样子
sg:这个是训练时用的算法,当为0时采用的是CBOW算法,当为1时会采用skip-gram
size:这个是定义训练的向量的长度
window:是在一个句子中,当前词和预测词的最大距离
alpha:是学习率,是控制梯度下降算法的下降速度的
seed:用于随机数发生器。与初始化词向量有关
min_count: 字典截断.,词频少于min_count次数的单词会被丢弃掉
max_vocab_size:词向量构建期间的RAM限制。如果所有不重复单词个数超过这个值,则就消除掉其中最不频繁的一个,None表示没有限制
sample:高频词汇的随机负采样的配置阈值,默认为1e-3,范围是(0,1e-5)
workers:设置多线程训练模型,机器的核数越多,训练越快
hs:如果为1则会采用hierarchica·softmax策略,Hierarchical Softmax是一种对输出层进行优化的策略,输出层从原始模型的利用softmax计算概率值改为了利用Huffman树计算概率值。如果设置为0(默认值),则负采样策略会被使用
negative:如果大于0,那就会采用负采样,此时该值的大小就表示有多少个“noise words”会被使用,通常设置在(5-20),默认是5,如果该值设置成0,那就表示不采用负采样
cbow_mean:在采用cbow模型时,此值如果是0,就会使用上下文词向量的和,如果是1(默认值),就会采用均值
hashfxn:hash函数来初始化权重。默认使用python的hash函数
iter: 迭代次数,默认为5
trim_rule: 用于设置词汇表的整理规则,指定那些单词要留下,哪些要被删除。可以设置为None(min_count会被使用)或者一个接受(word, count, min_count)并返回utils.RULE_DISCARD,utils.RULE_KEEP或者utils.RULE_DEFAULT,这个设置只会用在构建词典的时候,不会成为模型的一部分
sorted_vocab: 如果为1(defau·t),则在分配word index 的时候会先对单词基于频率降序排序。
batch_words:每一批传递给每个线程单词的数量,默认为10000,如果超过该值,则会被截断

python-tensorflow

官方网站实现的是n-gram方式


cbow和skip-gram

Skip-Gram是给定input word来预测上下文。而CBOW是给定上下文,来预测input word
首先数据还是上边的数据

  • 读取数据

    words = []
    with open("c:/traindatav.txt", "r", encoding="utf-8") as f:
    for line in f.readlines():
      text = line.split(" => ")
      if len(text) == 2:
          lable = text[0].strip()
          listsentence = [w for w in text[1].split(" ") if re.match("[\u4e00-\u9fa5]+", w) and len(w) >= 2]
          words.extend(listsentence)
    

words存放单词,这里单词都是按照顺序进入words里边的

  • 构建词典

    vocabulary_size = 10000
    def build_dataset(words):
      count = [['UNK', -1]]  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
      dictionary = dict()
      for word, _ in count:
        dictionary[word] = len(dictionary)
      data = list()
      unk_count = 0
      for word in words:
        if word in dictionary:
          index = dictionary[word]
        else:
          index = 0  # dictionary['UNK']
          unk_count += 1
        data.append(index)
      count[0][1] = unk_count
      reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
      return data, count, dictionary, reverse_dictionary
    data, count, dictionary, reverse_dictionary = build_dataset(words)
    

vocabulary_size声明了词典里边用多少单词填充,其余的都用UNK填充,
这里筛选单词的条件是词频,当然这里如果有好的想法也可以自行改进,比如去头除尾,词频太高的也不要,词频太低的也不要,我这里选择了10000歌词去训练
其中dictionary中存放的数据如下图


dictionary

这里边的数据表示为每个词标注一个索引

其中data里边存放的数据如下图


data

这里边的数数字标识了words里边词的对应的索引,数据都是从上边的dictionary中取出来的
其中count表示的是词频统计,如下图


count

reverse_dictionary表示的是dictionary的反转


reverse_dictionary
  • 参数声明

    batch_size = 128
    embedding_size = 128  # Dimension of the embedding vector.
    skip_window = 1       # How many words to consider left and right.
    num_skips = 2         # How many times to reuse an input to generate a label.
    # We pick a random validation set to sample nearest neighbors. Here we limit the
    # validation samples to the words that have a low numeric ID, which by
    # construction are also the most frequent.
    valid_size = 16     # Random set of words to evaluate similarity on.
    valid_window = 100  # Only pick dev samples in the head of the distribution.
    valid_examples = np.random.choice(valid_window, valid_size, replace=False)
    num_sampled = 64    # Number of negative examples to sample.
    
  • 构建skip-gram模型的迭代函数

    data_index = 0
    def generate_batch(batch_size, num_skips, skip_window):
      global data_index
      assert batch_size % num_skips == 0
      assert num_skips <= 2 * skip_window
      batch = np.ndarray(shape=(batch_size), dtype=np.int32)
      labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
      span = 2 * skip_window + 1  # [ skip_window target skip_window ]
      buffer = collections.deque(maxlen=span)
      for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
      for i in range(batch_size // num_skips):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [skip_window]
        for j in range(num_skips):
          while target in targets_to_avoid:
            target = random.randint(0, span - 1)
          targets_to_avoid.append(target)
          batch[i * num_skips + j] = buffer[skip_window]
          labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
      return batch, labels
    

其中batch = np.ndarray(shape=(batch_size), dtype=np.int32)是产生一个128维的向量, labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)时产生128*1的一个矩阵,buffer里边存放的是选出来的一个窗口上下文词的索引,数据来源于data,data_index全局标识words的索引,也就是data的每一个值,其作用是为了在每一次迭代的过程中平滑的去产生上下文窗口。

buffer上下文

一个叫做skip_window的参数,它代表着我们从当前input word的一侧(左边或右边)选取词的数量。num_skips,它代表着我们从整个窗口中选取多少个不同的词作为我们的output word

  • 构建计算图

    graph = tf.Graph()
    with graph.as_default():
    
      # Input data.
      train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
      train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
      valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    
      # Ops and variables pinned to the CPU because of missing GPU implementation
      with tf.device('/cpu:0'):
        # Look up embeddings for inputs.
        embeddings = tf.Variable(
            tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)
    
        # Construct the variables for the NCE loss
        nce_weights = tf.Variable(
            tf.truncated_normal([vocabulary_size, embedding_size],stddev=1.0 / math.sqrt(embedding_size)))
        nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
    
      # Compute the average NCE loss for the batch.
      # tf.nce_loss automatically draws a new sample of the negative labels each
      # time we evaluate the loss.
      loss = tf.reduce_mean(
          tf.nn.nce_loss(weights=nce_weights, biases=nce_biases, inputs=embed, labels=train_labels, num_sampled = num_sampled, num_classes=vocabulary_size))
    
      # Construct the SGD optimizer using a learning rate of 1.0.
      optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
    
      # Compute the cosine similarity between minibatch examples and all embeddings.
      norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
      normalized_embeddings = embeddings / norm
      valid_embeddings = tf.nn.embedding_lookup(
          normalized_embeddings, valid_dataset)
      similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
    
      # Add variable initializer.
      init = tf.global_variables_initializer()
    

首先声明数据placeholder,train_inputs【128】,train_labels【128x1】,然后声明valid_dataset,这个是存放词频相对比较高一些有效词,主要是为了训练结束后计算这些词的相似词
embeddings【10000x128】的词向量矩阵,embed要训练批次对应的词向量矩阵,nce_weights表示nce损失下的权重矩阵,tf.truncated_normal()产生的是一个截尾的正态分布,nce_biases表示偏置项,loss就是损失函数,也就是目标函数,optimizer表示的是迭代优化随机梯度下降法,用以优化loss函数,步长为1.0,similarity是为了根据embeddings计算valid_dataset中存放的词的相似度

大概的神经网络图如图,知道原理即可,图也是借来的

神经网络图
  • 开始迭代计算

    num_steps = 100001
    with tf.Session(graph=graph) as session:
      # We must initialize all variables before we use them.
      init.run()
      print("Initialized")
    
      average_loss = 0
      for step in range(num_steps):
        batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window)
        feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}
    
        # We perform one update step by evaluating the optimizer op (including it
        # in the list of returned values for session.run()
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val
    
        if step % 2000 == 0:
          if step > 0:
            average_loss /= 2000
          # The average loss is an estimate of the loss over the last 2000 batches.
          print("Average loss at step ", step, ": ", average_loss)
          average_loss = 0
    
        # Note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % 10000 == 0:
          sim = similarity.eval()
          for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8  # number of nearest neighbors
            nearest = (-sim[i, :]).argsort()[1:top_k + 1]
            log_str = "Nearest to %s:" % valid_word
            for k in range(top_k):
              close_word = reverse_dictionary[nearest[k]]
              log_str = "%s %s," % (log_str, close_word)
              print(log_str)
      final_embeddings = normalized_embeddings.eval()
    

其实上边的训练很简单,每次迭代都会根据generate_batch产生batch_inputs, batch_labels,这就是要喂给graph的数据,然后就是执行迭代了,迭代过程中,每个2000次都会输出平均的误差,每个10000次都会计算一下valid_dataset中的词的前topK=8的相似词, 最后final_embeddings存储的就是标准化的词向量。

-最后就是可视化

  def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  # in inches
    for i, label in enumerate(labels):
      x, y = low_dim_embs[i, :]
      plt.scatter(x, y)
      plt.annotate(label,
             xy=(x, y),
             xytext=(5, 2),
             textcoords='offset points',
             ha='right',
             va='bottom')

    plt.savefig(filename)

  try:
    from sklearn.manifold import TSNE
    import matplotlib.pyplot as plt

    tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
    plot_only = 500
    low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
    labels = [reverse_dictionary[i] for i in range(plot_only)]
    plot_with_labels(low_dim_embs, labels)

  except ImportError:
    print("Please install sklearn, matplotlib, and scipy to visualize embeddings.")

可视化采用的是TSNE,这里就不多说了,如果项具体了解,请参考:数据降维,其他的就不多说了。

word2vec的spark实现

至于spark的实现就直接上代码了,这个很简单,而且官网上也有很详细的教程,个人感觉spark做的api简直就是再也不能人性化了,未来spark的方向也是深度学习和实时流,这个我个人感觉也算是走上spark的主流道路了。坐等人性化深度学习api的来临。
废话不多说,直接上代码。

  object WordToVec {
    def main(args :Array[String]): Unit ={
      val conf = new SparkConf().setAppName("WordToVec")
        .setMaster("local")
      val sc = new SparkContext(conf)
      val stopwords = Array("的","是","你","我","他","她","它","和","了","而","有","人","被","做","对","与") //无效词
      val input = sc.textFile("c:/traindataw2v.txt")
        .map(line => line.split(" "))
        .map(_.filter(_.matches("[\u4E00-\u9FA5]+")).toSeq) //过滤中文
        .map(_.filter(!stopwords.contains(_)))
        .map(_.filter(_.length >= 2)) //长度必须大于2
      val word2vec = new Word2Vec()
        .setMinCount(2)  //词频大于2的词才能入选词典
        .setWindowSize(5) //上下文窗口长度为5
        .setVectorSize(50) //词的向量维度为50
        .setNumIterations(25) //迭代次数为25
        .setNumPartitions(3) // 数据分区3
        .setSeed(12345) //随机数产生seed
      val model = word2vec.fit(input)
  //    model.save(sc, "D:/word2vecTmal")
  //    val model = Word2VecModel.load(sc,"D:/word2vecTmal")
      val word = model.getVectors.keySet
      val writer = new PrintWriter(new File("c:/resultw2v.txt" ))
      model.getVectors.foreach(kv => {
        writer.write(kv._1 + " => " + kv._2.mkString(" ") + "\n")
      })
      writer.close()
      val synonyms = model.findSynonyms("很好", 5) //计算很好一次的5个最相似的词并输出
      for((synonym, cosineSimilarity) <- synonyms) {
        println(s"$synonym $cosineSimilarity")
      }
      sc.stop()
    }
  }

总结

个人建议,训练word2vec的时,如果想在单机情况下去训练的话最好用第一种方案,如果想在集群,或者数据量比较大的情况下可以采用分布式的spark训练,这两个的结果可靠性都要比tensorflow官方实现的要好。这跟tensorflow的实现方法是有直接关系的。
好了不多说了,大家可以自己去实践一下,毕竟我说的不算,实践是最好的老师。后续会持续书写相关的算法,敬请期待,都是干货,不掺水。

推荐阅读更多精彩内容