# [Word2Vec系列][EMNLP 2014]GloVe

https://pdfs.semanticscholar.org/b397/ed9a08ca46566aa8c35be51e6b466643e5fb.pdf

# Intro

The two main model families for learning word vectors are: 1) global matrix factorization meth- ods, such as latent semantic analysis (LSA) (Deer- wester et al., 1990) and 2) local context window methods, such as the skip-gram model of Mikolov et al. (2013c)

While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context win- dows instead of on global co-occurrence counts.

GloVe一定程度上吸收了两个的有点

# Related

In the skip-gram and ivLBL models, the objective is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its context.
Word2Vec的两种算法，一种根据中心单词预测上下文，一种根据上下文预测中心单词
suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus.

# Model

X word-word co-occurrence matrix
Xij number of times word j occurs in the context of word i
Xi number of times any word appears in the context of word i
Pij = P(j|i) = Xij/Xi probability that word j appear in the context of word i.

Compared to the raw probabilities, the ratio is better able to distinguish relevant words from irrelevant words and it is also better able to discriminate between the two relevant words.

solid跟ice比较相关跟steam无关，所以Pik/Pjk比较大

Pik/Pjk

![](http://www.forkosh.com/mathtex.cgi? F(w_i, w_j, \widetilde{w}k)=\frac{P{ik}}{P_{jk}})

Since vector spaces are inherently linear structures, the most natural way to do this is with vector differences

![](http://www.forkosh.com/mathtex.cgi? F(w_i- w_j, \widetilde{w}k)=\frac{P{ik}}{P_{jk}})
While F could be taken to be a complicated function parameterized by, e.g., a neural network, do- ing so would obfuscate the linear structure we are trying to capture. To avoid this issue, we can first take the dot product of the arguments

![](http://www.forkosh.com/mathtex.cgi? F((w_i-w_j)^T\widetilde{w}k=\frac{P{ik}}{P_{jk}})

<font color=gray>不是很明白这里的搞发搞发</font>

<font color=gray>不是很明白为什么这样就可以了</font>

{ik}}{X_i})
F是exp函数，两边都取log的话
![](http://www.forkosh.com/mathtex.cgi? w_i^T\widetilde{w}k = log(P{ik}) = log(X_{ik})-log(X_i))
X_i是一个跟k无关的量，所以可以作为bias被$$X {ik}$$吸收进去。b_i for w_i

![](http://www.forkosh.com/mathtex.cgi? w_i^T\widetilde{w}k + b_i + \widetilde{b}k = log(X{ik}))
log操作可以遇到log(0)的情况，可以用log(1+X
{ik})代替log(X_{ik})

A main drawback to this model is that it weighs all co-occurrences equally, even those that happen rarely or never.

![](http://www.forkosh.com/mathtex.cgi? J = \sum_{i,j=1}V{f(X_{ij})[{w_iT\widetilde{w}_i + b_i + \widetilde{b}j - log(X{ij})}]^2 )

f(X_{ij})是weight
V是词汇量

f要满足三个特性：

1. f(0) = 0，或者极限在0处值为0
2. f (x) should be non-decreasing so that rare co-occurrences are not overweighted.
3. f ( x ) should be relatively small for large val- ues of x, so that frequent co-occurrences are not overweighted.

![](http://www.forkosh.com/mathtex.cgi? f(x)=\begin{cases} (x/x_{max})^\alpha & \text{$x<x_{max}$}\ 1& \text{otherwise} \end{cases})

f(.)

