# [Word2Vec系列][EMNLP 2014]GloVe

https://pdfs.semanticscholar.org/b397/ed9a08ca46566aa8c35be51e6b466643e5fb.pdf

# Intro

The two main model families for learning word vectors are: 1) global matrix factorization meth- ods, such as latent semantic analysis (LSA) (Deer- wester et al., 1990) and 2) local context window methods, such as the skip-gram model of Mikolov et al. (2013c)

While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context win- dows instead of on global co-occurrence counts.

GloVe一定程度上吸收了两个的有点

# Related

In the skip-gram and ivLBL models, the objective is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its context.
Word2Vec的两种算法，一种根据中心单词预测上下文，一种根据上下文预测中心单词
suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus.

# Model

X word-word co-occurrence matrix
Xij number of times word j occurs in the context of word i
Xi number of times any word appears in the context of word i
Pij = P(j|i) = Xij/Xi probability that word j appear in the context of word i.

Compared to the raw probabilities, the ratio is better able to distinguish relevant words from irrelevant words and it is also better able to discriminate between the two relevant words.

solid跟ice比较相关跟steam无关，所以Pik/Pjk比较大

Pik/Pjk

![](http://www.forkosh.com/mathtex.cgi? F(w_i, w_j, \widetilde{w}k)=\frac{P{ik}}{P_{jk}})

Since vector spaces are inherently linear structures, the most natural way to do this is with vector differences

![](http://www.forkosh.com/mathtex.cgi? F(w_i- w_j, \widetilde{w}k)=\frac{P{ik}}{P_{jk}})
While F could be taken to be a complicated function parameterized by, e.g., a neural network, do- ing so would obfuscate the linear structure we are trying to capture. To avoid this issue, we can first take the dot product of the arguments

![](http://www.forkosh.com/mathtex.cgi? F((w_i-w_j)^T\widetilde{w}k=\frac{P{ik}}{P_{jk}})

<font color=gray>不是很明白这里的搞发搞发</font>

<font color=gray>不是很明白为什么这样就可以了</font>

{ik}}{X_i})
F是exp函数，两边都取log的话
![](http://www.forkosh.com/mathtex.cgi? w_i^T\widetilde{w}k = log(P{ik}) = log(X_{ik})-log(X_i))
X_i是一个跟k无关的量，所以可以作为bias被$$X {ik}$$吸收进去。b_i for w_i

![](http://www.forkosh.com/mathtex.cgi? w_i^T\widetilde{w}k + b_i + \widetilde{b}k = log(X{ik}))
log操作可以遇到log(0)的情况，可以用log(1+X
{ik})代替log(X_{ik})

A main drawback to this model is that it weighs all co-occurrences equally, even those that happen rarely or never.

![](http://www.forkosh.com/mathtex.cgi? J = \sum_{i,j=1}V{f(X_{ij})[{w_iT\widetilde{w}_i + b_i + \widetilde{b}j - log(X{ij})}]^2 )

f(X_{ij})是weight
V是词汇量

f要满足三个特性：

1. f(0) = 0，或者极限在0处值为0
2. f (x) should be non-decreasing so that rare co-occurrences are not overweighted.
3. f ( x ) should be relatively small for large val- ues of x, so that frequent co-occurrences are not overweighted.

![](http://www.forkosh.com/mathtex.cgi? f(x)=\begin{cases} (x/x_{max})^\alpha & \text{$x<x_{max}$}\ 1& \text{otherwise} \end{cases})

f(.)

### 推荐阅读更多精彩内容

• **2014真题Directions:Read the following text. Choose the be...
又是夜半惊坐起阅读 5,183评论 0 22
• Sherlock Holmes 三年的时间精雕细琢的这道盛宴，连王室都在翘首以盼，英国...
安安然_阅读 90评论 0 0
• 老师给我们发美味可口的糖葫芦。她刚把糖葫芦拿出来，大家就疯狂地抢起来，只听“啊——"的一声，冰糖葫芦就被抢光了。老...
文章里的故事阅读 21评论 0 1
• 警方根据与孟书往来频繁的名单查到了清音的号码和住址。 去警局的路上，清音在脑海中不断回放着这段听到的话语。 “你认...
a48d4bdef9d1阅读 52评论 0 1
• 您或您的孩子经常流鼻血吗？ 现在分享您一个小验方，试试吧！也许就解决了您许久以来的烦恼了呢！ 我爷爷是老中医，这个...
梓毓爸阅读 88评论 0 2