# [論文速速讀]NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Posted by John on 2020-05-06
Words 1.5k and Reading Time 6 Minutes
Viewed Times

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

## 前言

attention在nlp很紅眾所皆知，這篇據說是nlp第一個將attention概念引入的paper，還不來拜見衣食父母

## ABSTRACT

In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

## INTRODUCTION

In order to address this issue, we introduce an extension to the encoder–decoder model which learns to align and translate jointly. Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

• 直接翻有點難懂，就是注意力機制，看當前哪個部分是最重要的!

The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.

## BACKGROUND: NEURAL MACHINE TRANSLATION

### RNN ENCODER–DECODER

RNN based的encoder-decoder，有興趣的自己去paper看，想說這裡還算簡單不想打一堆annotation…QQ

## LEARNING TO ALIGN AND TRANSLAT

### DECODER: GENERAL DESCRIPTION

$si$是RNN的在time $i$的hidden state，前一個時刻的output$y{i-1}$和context vector$c_i$

$h$是intput sequence丟到RNN的output，這裡RNN用的是雙向所以output vector會是2倍。而$\alpha$則是透過下列公式計算:

$eij=a(s{i-1}, h_j)$是一個alignment model，透過一個分數紀錄了對於第i個output，input j附近的位置的重要性

• 注意在這篇paper中這一項是被train出來的，attention機制在後面的研究不一定需要train
• 細節在最後的附錄有提到

Intuitively, this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed length vector.

### ENCODER: BIDIRECTIONAL RNN FOR ANNOTATING SEQUENCES

Encoder使用了Bidirectional RNN，好這裡應該不是重點xD

Encoder就是藍色跟紅色的BiRNN，產出$h_i= [ \overrightarrow{h_i} ; \overleftarrow{h_i} ]$，後面decoder做的事情就跟上面講的一樣了

## EXPERIMENT SETTINGS

### QUALITATIVE ANALYSIS

#### ALIGNMENT

[European Economic Area]

[zone economique europ ´ een]

• 可以更有彈性
• 可以多對多
• soft-align允許input/output長度不同
• hard-align如果長度不同就要加入一些額外的token，Ex: [NULL]，不過這是別篇論文在做的事情了

### LEARNING TO ALIGN

…Our approach, on the other hand, requires computing the annotation weight of every word in the source sentence for each word in the translation. This drawback is not severe with the task of translation in which most of input and output sentences are only 15–40 words. However, this may limit the applicability of the proposed scheme to other tasks.

## Appendix

p.s. 這裡是把$s_{i-1}$跟$h_j$加起來唷，之後再繼續看其他的attention paper會看到有很多種作法(add, concat…)

>