DIN(Deep Interest Network):核心思想+源码阅读注释

论文地址： https://arxiv.org/abs/1706.06978
DIN是近年来较为有代表性的一篇广告推荐CTR预估的文章，其中使用的attention机制也为使用序列特征提供了新的思路。本文分析了DIN的核心思想。鉴于DIN源代码的变量命名过于随意，这里也提供了部分源代码的注释，仅供参考。

论文分析

核心思想：用户的兴趣是多元化的（diversity），并且对于特定的广告，用户不同的兴趣会产生不同的影响(local activation)。
举个DIN论文上的例子：

一位年轻的宝妈，在淘宝上点击了一个新款包包的广告（说明她对这一广告很感兴趣）。我们需要研究是什么因素造成了她会点击，以便今后给她投放类似的广告。一般来说，用户行为特征（user bahavior features）（如用户近期浏览/点击/购买过的商品等）具有决定性的因素。用户行为特征一般是指某个用户对于多个商品的行为序列。不同的用户，其用户行为序列的长度和内容都有很大差异（这就是diversity）。对于这位宝妈，假设其用户序列中包含了她最近浏览的新款托特包，皮革手袋，以及锅碗瓢盆等。很明显，和锅碗瓢盆相比，前两个商品对于她点击这一新款包包广告这一行为更具有决定性因素（这就是local activation）。那么对于其他的广告，如厨房清洁剂广告，则其用户行为序列中浏览的锅碗瓢盆部分会成为预测该宝妈是否会点击该广告（即CTR预测）的决定性因素。换句话说，预测某用户对于某广告的CTR时，不能对用户行为序列中所有商品都一视同仁，而是要考虑目标广告的具体内容与用户行为序列的结合。
现存的Embedding+MLP模型的问题

现有的CTR预测模型，如FNN，Wide&Deep，Deep Crossing，PNN等，其主要结构都是使用FM等方式实现Embedding，将大规模稀疏的Web数据转化为稠密的vector。由于输入特征中的用户行为序列一般为multi-hot编码，因此不同用户数据embedding后的vector长度数不同的。一般会将这些vector通过一个pooling层（sum pooling 或mean pooling）得到长度固定的vector，并输入到后续的MLP中训练。这一过程的问题是，所有用户的特征都用一个定长的vector来表示，并没有考虑到用户序列特征和目标广告之间的关系。DIN论文中使用的base model就是这样一种模型，如图1所示：

图1. base model

那么如何将两者结合起来呢？论文设计的DIN模型，可以自适应地在计算用户兴趣向量时考虑到用户历史行为与候选广告之间的关系。(原文：Instead pf expressing all user's diverse interests with the ssame vector, DIN adaptively calculates the representation vector of user interests by taking into consideration the relevance of historical behaviors w.r.t. candidate ad. ) 这种用户兴趣向量，对于不同的候选广告(candidate ad)来说是不同的。

至于如何在网络结构中实现，DIN引入了一种local activation unit，来计算用户行为特征和候选广告之间的关系，如图2所示:

图2. DIN model

对于候选广告，根据local activation unit计算出的用户兴趣向量为：
$\boldsymbol v_U(A)=f(\boldsymbol {v_A,e_1,e_2,...,e_H} )=\sum_{j=1}^Ha\boldsymbol{(e_j,v_A)e_j}=\sum_{j=1}^H\boldsymbol{w_je_j} \tag 1$

其中 $\boldsymbol {\{e_1,e_2,...,e_H\}}$ 为代表用户 $\boldsymbol U$ 的行为序列的embedding向量，长度为H， $\boldsymbol v_A$ 为广告 $\boldsymbol A$ 的embedding 向量。在这种计算方式下，最终的用户 $\boldsymbol U$ 的兴趣向量会根据不同的广告 $\boldsymbol A$ 而变化，这里 $a(\cdot)$ 表示一个feed-forward network，其输出作为local activation的权值，与用户向量相乘，如图2中的标注所示。

Local activation借鉴了NMT(Neural Machine Translation)中的attention机制。不同的是，传统attention层会做归一化处理，因此式(1)中的权值需要满足 $\sum_iw_i=1$ 。但是在DIN中则没有此限制， $\sum_iw_i$ 的值被视为对用户兴趣强度值的近似，这样对于不同的广告A，的取值范围不同，更能够体现出local activation的意义。

部分源码解读

开源代码地址（重点看/din/model.py）：https://github.com/zhougr1993/DeepInterestNetwork/blob/master/din/model.py
DIN的开放源代码中的变量命名非常的难以理解(i, j, h, y等等不知道什么意思)。这里经过我的实验和推(xia)理(cai)，总结了部分代码的注释。个人理解，仅供参考。

首先是/din/model.py中的开始部分

import tensorflow as tf
from Dice import dice
class Model(object):
  def __init__(self, user_count, item_count, cate_count, cate_list,\
                               predict_batch_size, predict_ads_num):
    self.u = tf.placeholder(tf.int32, [None,]) 
    # shape: [B],  user id。 (B：batch size)
    self.i = tf.placeholder(tf.int32, [None,]) 
    # shape: [B]  i: 正样本的item
    self.j = tf.placeholder(tf.int32, [None,]) 
    # shape: [B]  j: 负样本的item
    self.y = tf.placeholder(tf.float32, [None,]) 
    # shape: [B], y: label
    self.hist_i = tf.placeholder(tf.int32, [None, None]) 
    # shape: [B, T] #用户行为特征(User Behavior)中的item序列。T为序列长度
    self.sl = tf.placeholder(tf.int32, [None,]) 
    # shape: [B]; sl：sequence length，User Behavior中序列的真实序列长度（？）
    self.lr = tf.placeholder(tf.float64, [])
    # learning rate
    hidden_units = 128
    user_emb_w = tf.get_variable("user_emb_w", [user_count, hidden_units])       
    # shape: [U, H], user_id的embedding weight. U是user_id的hash bucket size

    item_emb_w = tf.get_variable("item_emb_w", [item_count, hidden_units // 2])  #[I, H//2]
     # shape: [I, H//2], item_id的embedding weight. I是item_id的hash bucket size

    item_b = tf.get_variable("item_b", [item_count],
                             initializer=tf.constant_initializer(0.0))           
    # shape: [I], bias
    cate_emb_w = tf.get_variable("cate_emb_w", [cate_count, hidden_units // 2])  
    # shape: [C, H//2], cate_id的embedding weight. 

    cate_list = tf.convert_to_tensor(cate_list, dtype=tf.int64)   
    # shape: [C, H//2]

    ic = tf.gather(cate_list, self.i) 
    # 从cate_list中取出正样本的cate
    i_emb = tf.concat(values = [   
        tf.nn.embedding_lookup(item_emb_w, self.i),
        tf.nn.embedding_lookup(cate_emb_w, ic),
        ], axis=1)
    # 正样本的embedding，正样本包括item和cate

    i_b = tf.gather(item_b, self.i)

    jc = tf.gather(cate_list, self.j) 
    # 从cate_list中取出负样本的cate
    j_emb = tf.concat([             
        tf.nn.embedding_lookup(item_emb_w, self.j),
        tf.nn.embedding_lookup(cate_emb_w, jc),
        ], axis=1)
    # 负样本的embedding，负样本包括item和cate

    j_b = tf.gather(item_b, self.j) #偏置b
    hc = tf.gather(cate_list, self.hist_i) 
    # 用户行为序列(User Behavior)中的cate序列

    h_emb = tf.concat([tf.nn.embedding_lookup(item_emb_w, self.hist_i),
        tf.nn.embedding_lookup(cate_emb_w, hc),
        ], axis=2)
    #用户行为序列(User Behavior)的embedding，包括item序列和cate序列
    hist_i = attention(i_emb, h_emb, self.sl) #attention操作
    #-- attention end ---

接着跳到/din/model.py中的Line199看attention部分

def attention(queries, keys, keys_length):
  '''
    queries:     shape: [B, H], 即i_emb
    keys:        shape: [B, T, H], 即h_emb
    keys_length: shape: [B], 即self.sl
    B:batch size; T: 用户序列的长度；H：embedding size
  '''
  queries_hidden_units = queries.get_shape().as_list()[-1]                     
  # shape: [H]
  queries = tf.tile(queries, [1, tf.shape(keys)[1]])                            
  # [B,H] -> T*[B,H]
  queries = tf.reshape(queries, [-1, tf.shape(keys)[1], queries_hidden_units])  
  # T*[B,H] ->[B, T, H]
  din_all = tf.concat([queries, keys, queries-keys, queries*keys], axis=-1)     
  # attention操作，输出维度为[B, T, 4*H]
  d_layer_1_all = tf.layers.dense(din_all, 80, activation=tf.nn.sigmoid, \
                              name='f1_att', reuse=tf.AUTO_REUSE) # [B, T, 80]
  d_layer_2_all = tf.layers.dense(d_layer_1_all, 40, activation=tf.nn.sigmoid, \
                              name='f2_att', reuse=tf.AUTO_REUSE) # [B, T, 40]
  d_layer_3_all = tf.layers.dense(d_layer_2_all, 1, activation=None, \
                              name='f3_att', reuse=tf.AUTO_REUSE) # [B, T, 1]
  d_layer_3_all = tf.reshape(d_layer_3_all, [-1, 1, tf.shape(keys)[1]]) #[B, 1, T]
  outputs = d_layer_3_all # attention的输出, [B, 1, T]

  # Mask
  key_masks = tf.sequence_mask(keys_length, tf.shape(keys)[1])   # [B, T]
  key_masks = tf.expand_dims(key_masks, 1) # [B, 1, T]
  paddings = tf.ones_like(outputs) * (-2 ** 32 + 1) 
  # padding的mask后补一个很小的负数，这样softmax之后就会接近0.
  outputs = tf.where(key_masks, outputs, paddings)  
  # [B, 1, T] padding操作，将每个样本序列中空缺的商品都赋值为(-2 ** 32 + 1)

  # Scale
  outputs = outputs / (keys.get_shape().as_list()[-1] ** 0.5)

  # Activation
  outputs = tf.nn.softmax(outputs) 
  # [B, 1, T] #这里的output是attention计算出来的权重，即论文公式(3)里的w，

  # Weighted sum
  outputs = tf.matmul(outputs, keys)  
  # [B, 1, H]

  return outputs

如果有疑问或指正，欢迎在评论区提出！！