[論文速速讀]Hierarchical Attention Networks for Document Classification

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

論文網址: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf

Abstract

We propose a hierarchical attention network for document classification. Our model has two distinctive characteristics:
(i) it has a hierarchical structure that mirrors the hierarchical structure of documents;
(ii) it has two levels of attention mechanisms applied at the word and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation.

提出一個基於hierarchical attention architecture的模型，用於文本分類任務，abstract點出兩個特色:

使用了可以反映文章結構的的階層結構
在word level和sentence level上都使用了attention mechanism，使得在建構representation時能夠注意到比較重要的內容

Introduction

Although neural-network–based approaches to text classification have been quite effective (Kim, 2014; Zhang et al., 2015; Johnson and Zhang, 2014; Tang et al., 2015), in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture. The intuition underlying our model is that not all parts of a document are equally relevant for answering a query and that determining the relevant sections involves modeling the interactions of the words, not just their presence in isolation

儘管一些方法在文本分類上已有不錯的表現，但這篇paper證明了一個假設:透過將文章結構的資訊加入模型架構中可以使文章有夠好的representation。

設計模型的直覺來自：並非文章的所有部分都對建立模型有同等的貢獻，並且我們應該對於word之間的相關性也加入考慮建模，而非將他們視作獨立的表示。

First, since documents have a hierarchical structure (words form sentences, sentences form a document), we likewise construct a document representation by first building representations of sentences and then aggregating those into a document representation. Second, it is observed that different words and sentences in a documents are differentially informative. Moreover, the importance of words and sentences are highly context dependent.

可以將文章分成兩個階段的結構體: “由word組成的sentence”和”由sentence組成的document”。HAN透過先建構sentence-level的representation，再將這些組合成一個document的presentation。

並且可以觀察到:

不同的word跟sentence在文章中具有不同的資訊量
word和sentence的重要性很大程度上取決於上下文

Hierarchical Attention Networks

HAN包含了以下幾個部分:

word sequence encoder
word-level attention layer
sentence encoder
sentence-level attention layer

GRU-based sequence encoder

先介紹GRU

GRU的new state:

$h_t=(1-z_t)\odot h_{t-1}+z_t\odot \tilde h_t$

其中candidate state$\tilde h_t$透過下面這項來更新:

$\tilde h_t= tanh(W_h x_t + r_t \odot (U_h h_{t-1})+b_h)$

不同於LSTM，GRU只用了reset gate $r$和 update gate $z$，這兩個gate的更新如下:

$z_t = \sigma(W_z x_t+U_z h_{t-1}+b_z) \\ r_t = \sigma(W_r x_t+U_r h_{t-1}+b_r)$

Hierarchical Attention

Word Encoder

將word $x$做word embedding後，透過一個雙向的GRU得到encoding後的vector:

$x_{it}=W_e w_{it}, t\in[1, T] \\ \overrightarrow h_{it}=\overrightarrow{GRU}(x_{it}), t\in[1, T]\\ \overleftarrow h_{it}=\overleftarrow{GRU}(x_{it}), t\in[1, T]$

Word Attention

將 $h_{it}=[\overrightarrow{h_{it}}, \overleftarrow{h_{it}}]$ 進行一次transformation，然後透過一個content vector $u_w$來進行attention，這裡的 $u_w$ 可以被視為a high level representation of a fixed query “what is the imformative word” over the words like that used in menory networks

$u_{it}=tanh(W_h h_{it}+b_w) \\ \alpha_{it} = \frac{exp(u_{it}^{T}u_w)}{\sum_t exp(u_{it}^{T}u_w)} \\ s_i=\sum_t\alpha_{it}h_{it}$

產生出來的$s$就是sentence vector

Sentence Encoder

跟Word一樣，跳過

Sentence Attention

Jumping

Document Classification

最後透過Sentence Attention可以得出一個document vector，經過一層transformation後接softmax，用cross entropy去分類文章。

Experiments

Datasets

在多個datasets上測試，這些dataset主要分成兩種類型: sentiment estimation and topic classification

80% training, 10% validation, 10% testing

Yelp reviews(有三年，一年一個)
IMDB reviews
Yahoo answers
Amazon reviews

Model configuration and training

使用Stanford’s CoreNLP (Manning et al., 2014)切sentence和word
word2vec做word embedding, dimension=200
frequency小於5的換成
GRU dimension=50
- 所以biGRU就會是100
word/sentence context vector dimension=100
batch size=64
optimizer use SGD

Results and analysis

The experimental results on all data sets are shown in Table 2. We refer to our models as HN-{AVE, MAX, ATT}. Here HN stands for Hierarchical Network, AVE indicates averaging, MAX indicates max-pooling, and ATT indicates our proposed hierarchical attention model. Results show that HNATT gives the best performance across all data sets

同時也比較了階層架構下使用averaging, max-pooling和HAN的效果，總之

HAN棒棒噠

Context dependent attention weights

下圖是’good’對於評分的重要程度，(b)-(f)分別代表1-5分，可以發現評分越好的圖，’good’的weight越大。

因為word的重要性很大部分取決於上下文，沒有注意力機制的模型可能會將word獨立的去進行判斷，但是就如’good’仍然會出現在評分低的評論中(例如只對部分產品感到滿意，或是用了否定句: ‘not good’)，此時這些模型就無法work的很好。不過就如下圖看到的，在HAN中並不會出現這個狀況。

再來看看’bad’的: