[論文速速讀]NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

前言

attention在nlp很紅眾所皆知，這篇據說是nlp第一個將attention概念引入的paper，還不來拜見衣食父母

paper: NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

作者的slide: Neural Machine Translation by Jointly Learning to Align and Translate

ABSTRACT

In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

以往神經網路再翻譯的任務上都是建立在encoder-decoder的架構上，固定長度的context vector會成為效能的bottleneck，所以這篇文章提出了dynamic context vector的概念，讓model自己去搜尋input和related predict的相關部分，使得效能upupup。

這也是最早提出attention mechanism的paper。

INTRODUCTION

一開始先後介紹了Neural machine translation, encoder-decoder架構。

接下來講了encoder-decoder的缺點: context vector必須要足以包含整個input sentence的information，否則效果就會不好，當句子長度變長時這個問題就會浮現出來。

接下來講自己的方法棒棒哒:

In order to address this issue, we introduce an extension to the encoder–decoder model which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

模型會自到去找出source sentence中對當前最富有資訊量的部分，並透過這些部分與context vector結合來進行預測。

直接翻有點難懂，就是注意力機制，看當前哪個部分是最重要的!

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.

如果對attention還不熟的可以看這段，相較原本encoder會把所有input sequence壓成一個固定的context vector，透過了attention機制選擇了一個subset區域，特別針對那一區段來動態產生context vector，使得context vector能夠更有變化。

BACKGROUND: NEURAL MACHINE TRANSLATION

RNN ENCODER–DECODER

RNN based的encoder-decoder，有興趣的自己去paper看，想說這裡還算簡單不想打一堆annotation…QQ

LEARNING TO ALIGN AND TRANSLAT

DECODER: GENERAL DESCRIPTION

$s_i=f(s_{i-1},y_{i-1},c_i)$

$si$是RNN的在time $i$的hidden state，前一個時刻的output$y{i-1}$和context vector$c_i$

也就是每一個time stamp，context vector都會不同，透過下列公式得出:

$c_i=\sum^{T_x}_{j=1}\alpha_{ij}h_j$

$h$是intput sequence丟到RNN的output，這裡RNN用的是雙向所以output vector會是2倍。而$\alpha$則是透過下列公式計算:

$\alpha_{ij}=\frac{exp(e_{ij})}{\sum^{T_x}_{k=1}exp(e_{ik})}$

$eij=a(s{i-1}, h_j)$是一個alignment model，透過一個分數紀錄了對於第i個output，input j附近的位置的重要性

注意在這篇paper中這一項是被train出來的，attention機制在後面的研究不一定需要train
這個model會被加入原本的task一起train
細節在最後的附錄有提到

Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector.

ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES

Encoder使用了Bidirectional RNN，好這裡應該不是重點xD

所以全部合起來的架構就會變成下面這張圖

Encoder就是藍色跟紅色的BiRNN，產出$h_i= [ \overrightarrow{h_i} ; \overleftarrow{h_i} ]$，後面decoder做的事情就跟上面講的一樣了

EXPERIMENT SETTINGS

QUALITATIVE ANALYSIS

ALIGNMENT

比較了attention(左邊)機制下和原本encoder-decoder(右邊)的效果，數據棒棒哒的部分就不提了

來提一下比較有趣的attention分析，作者把他們模型train好後把對應的$\alpha_{ij}$值畫了出來

可以看到中間有一個很特別的斜槓，作者有提到由於英文跟法語在形容詞和名詞語法上的不同，所以

[European Economic Area]

會變成

[zone economique europ ´ een]

儘管如此，attention還是棒棒哒知道Area對應zone呢!

此外也提到了soft-align(attention)跟hard-align(每個位置依序對其)的優缺點:

可以更有彈性
可以多對多
soft-align允許input/output長度不同
- hard-align如果長度不同就要加入一些額外的token，Ex: [NULL]，不過這是別篇論文在做的事情了

LONG SENTENCES

LEARNING TO ALIGN

…Our approach, on the other hand, requires computing the annotation weight of every word in the source sentence for each word in the translation. This drawback is not severe with the task of translation in which most of input and output sentences are only 15–40 words. However, this may limit the applicability of the proposed scheme to other tasks.

由於attention會看input sequence的每一個token，在機器翻譯這個task中還好，因為每一句長度不大，不過當句子過長的時候可能就會造成計算上的cost

CONCLUSION

哎呀，雖然attention很有名，不過這篇提出的model叫做RNNsearch唷!
前面我都忘記提了補一下

Appendix

不要覺得附錄都不重要，我們直接來看那個神奇的alignment model，也就是$e$到底是怎麼設計的:

$e_{ij}=v^T_{a}tanh(W_a s_{i-1}+U_a h_j)$

p.s. 這裡是把$s_{i-1}$跟$h_j$加起來唷，之後再繼續看其他的attention paper會看到有很多種作法(add, concat…)

題外話，關於對add跟concat的理解可以參考這篇:[DL]Difference between add/concat operation in Neural Network

前言