[論文速速讀]Efficient Estimation of Word Representation in Vector Space

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

這篇就是鼎鼎大名的word2vec，在2013年由google提出。

Word2vec是一種word embedding技術，為什麼需要word embedding? 因為這樣可以將NLP task的word轉成一個numeric vector，然後加以進行分析和運算。

paper: Efficient Estimation of Word Representation in Vector Space

Abstract

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

word2vec的paper，提出了Skip-gram跟CBOW，相較於傳統的方法，能夠大幅提高訓練速度。

Introduction

Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data.

傳統的一些nlp task把word視作不可分割的單元(atomic units)，詞和詞之間不具相似度，因為他們都只是vocab table的index。這種方法簡單且robust，n-gram就是其中的一個應用。

However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.

但是這種方法也有他的極限，當資料詞庫過少的時候表現就會不好。

這篇paper的目標:

可從大量資料(數十億level)中學習word vectors的技術
- 據目前所知，還沒有方法提出可以在幾百萬的資料上學習50-100的詞向量
希望學習到相似的詞距離相近，而且詞有不同的相似度(multiple degrees of similarity)
- 名詞的多個詞尾可能在相近的空間中被找到
新的模型可以做到向量的線性操作
- vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen

接下來介紹NNLM(Neural Network Language Model): 一個線性投影層 + 一個非線性隱藏層，用來學習詞向量表示和統計語言模型

… Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM.

提到說NNLM先透過single hidden layer來訓練word vectors，然後在使用word vectors來train NNLM，所以即使NNLM沒有train完也可以train好word vectors。

word2vec就是基於這個想法的擴展應用。

Model Architectures

為了比較模型好壞，先定義接下來訓練深度模型的複雜度皆為: $O = E\times T\times Q$

E: 迭代次數
T: 訓練集的詞個數
Q: 模型參數
Common choice is E = 3 − 50 and T up to one billion

使用mini-batch + Adagrad + SGD來訓練

Feedforward Neural Net Language Model (NNLM)

(上圖是原始paper(A Neural Probabilistic Language Model)的圖，annotation可能會跟下面word2vec的對不起來，下面使用原本word2vec的annotation)

有4層: input, projection, hidden, output

input layer: 前N words使用one-hot encoding成V維的向量，V是(vocab size)
- 注意這裡是用前N words，而不是用所有words來訓練，這是和word2vec中CBOW投影層的差異!!
projection layer: input(NxV)會使用同一個projection matrix(VxD)投影到projection layer P(NxD)
- D是投影後的維度
- 共用一個projection matrix，所以這裡的cost還算低
hidden layer: 隱藏層來計算整個word的機率，有H個neuron
output layer有V個neuron

所以整體的模型參數量是 $Q=N×D+N×D×H+H×V$

其中output layer的HxV最重要
- 有一些優化的方法，例如hierarchical softmax，使用binary tree representations of the vocabulary(Huffman tree)，可以降到$\log_2(V)$
所以其實主要的計算量在hidden layer

Recurrent Neural Net Language Model (RNNLM)

原始paper: Linguistic Regularities in Continuous Space Word Representations

只有input, hidden, output層，訓練複雜度是 $Q=D×H+H×V$

D和隱藏層H有相同的維度
- 使用hierarchical softmax + huffman tree，H×V可以降低為H×$\log_2V$，所以大部分的複雜度來自D×H

Parallel Training of Neural Networks

使用DistBelief框架，該框架可以使用平行計算

New Log-linear Models

提出了兩個新模型，之前的研究表示大部分的複雜度是由於模型的非線性層導致的，所以這裡試圖減少非線性層的參數量

Continuous Bag-of-Words Model

和NNLM相似，但刪除了非線性層，並且投影層是所有的words共用(NNLM是前N words共用)

所有的单词都投影到同一个位置（所有向量取平均值）。这样不考虑单词的位置顺序信息，叫做词袋模型
- 詞的順序對於不影響投影
会用到将来的词，例如如果窗口 windows 为 2，这样训练中心词的词向量时，会选取中心词附近的 4 个上下文词（前面 2 个后面 2 个）
输出层是一个 log-linear 分类器

整體的模型參數量為: $Q=N×D+D×log(V)$

log(V)是用到了hierarchical softmax + huffman tree

Continuous Skip-gram Model

跟CBOW相似，不過是根據中心的詞去預測上下文

通過實驗發現，windows越大效果越好(但cost也越大)
距離較遠的word通常關聯性較小，所以透過抽取較少的樣本(降低機率)來降低對距離較遠word的權重

整體複雜度: $Q=C×(D+D×log(V))$

C是window size的2倍，也就是要預測的word個數
- 也就是說預測每個word所需的參數量是D+D×log(V)
- 看到logV就知道用到了hierarchical softmax

Results

We follow previous observation that there can be many different types of similarities between words, for example, word big is similar to bigger in the same sense that small is similar to smaller. Example of another type of relationship can be word pairs big - biggest and small - smallest [20]. We further denote two pairs of words with the same relationship as a question, as we can ask: ”What is the word that is similar to small in the same sense as biggest is similar to big?”

單詞間具有相似性，並且可透過向量加減法來得到語意上的正確結果

延伸討論: Word2vec implement detail

下面講的可能不在原始paper內，不過是實作上可能會被提及的技術。根據一些網站資源學習了之後加以整理:

Hidden layer沒有activation function

在input-hidden層，沒有非線性變換，而是簡單地把所有vector加總並取平均，以減少計算複雜度

Hierarchical Softmax v.s. Negative sampling

實際上，input layer有CBOW和Skip-gram兩種版本，output也有Hierarchical Softmax和Negative sampling兩種版本

Hierarchical Softmax

一般正常input-hidden-output的model如果套用在embedding training，因為output層的softmax計算量很大(要去算所有詞的softmax機率，再去找機率最大值)

透過huffman tree + Hierarchical Softmax，tree的根節點是每一個word，透過一步步走到leaf來求得softmax的值
- 從root到leaf只需要log(V)步
如此可以大幅的加快求得softmax的速度
缺點: 但是如果我们的训练样本里的中心词w是一个很生僻的词，那么就得在霍夫曼树中辛苦的向下走很久了
- 於是有了Negative Sampling

Negative sampling

定義window內的為正樣本，window外的為負樣本，如此就可以不用把全部的word拿進來一起train

只需要window內的所有正樣本 + sampling一定數量的負樣本就足夠訓練模型

透過Unigram distribution來模擬負樣本(不在window內的word)被選中的機率

設計這個分佈時希望詞被抽到的機率要跟這個詞出現的頻率有關，出現在文本中的頻率越高越高越有可能被抽到

公式為: $P(w_i) = \frac{ {f(w_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(w_j)}^{3/4} \right ) }$
$f(w_i)$代表$w_i$出現次數(頻率)，3/4是實驗try出來的數據

例如：有一個詞編號是 100，它出現在整個文本中 1000 次，所以 100 在 unigram table 就會出現 $1000^{0.75} = 177$ 次

至於要選幾個詞當 negative sample，paper 中建議如下

Our experiments indicate that values of k in the range 5–20 are useful for small training datasets, while for large datasets the k can be as small as 2–5.

Subsampling of frequent words

英文中 “the”, “a”, “in”，中文中的「的」、「是」等等這種詞，其實在句子中並沒有辦法提供太多資訊但又常常出現，對訓練沒有太大幫助，所以就用一個機率來決定這個詞是否要被丟掉，公式如下 $P(f_i) = (\sqrt{\frac{f(w_i)}{0.001}} + 1) \cdot \frac{0.001}{f(w_i)}$

結論

word2vec真滴神，然後skip-gram似乎普遍比CBOW好用

現在有很多方法可以直接使用word2vec，如gensim，不過了解了背後的由來相信能夠對於word2vec更有感覺，以及更清楚應該如何調整相關參數。