# [論文速速讀]Hierarchical Attention Networks for Document Classification

Posted by John on 2020-05-19
## Abstract

We propose a hierarchical attention network for document classification. Our model has two distinctive characteristics:
(i) it has a hierarchical structure that mirrors the hierarchical structure of documents;
(ii) it has two levels of attention mechanisms applied at the word and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation.

1. 使用了可以反映文章結構的的階層結構
2. 在word level和sentence level上都使用了attention mechanism，使得在建構representation時能夠注意到比較重要的內容

## Introduction

Although neural-network–based approaches to text classification have been quite effective (Kim, 2014; Zhang et al., 2015; Johnson and Zhang, 2014; Tang et al., 2015), in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture. The intuition underlying our model is that not all parts of a document are equally relevant for answering a query and that determining the relevant sections involves modeling the interactions of the words, not just their presence in isolation

First, since documents have a hierarchical structure (words form sentences, sentences form a document), we likewise construct a document representation by first building representations of sentences and then aggregating those into a document representation. Second, it is observed that different words and sentences in a documents are differentially informative. Moreover, the importance of words and sentences are highly context dependent.

1. 不同的word跟sentence在文章中具有不同的資訊量
2. word和sentence的重要性很大程度上取決於上下文

## Hierarchical Attention Networks

HAN包含了以下幾個部分:

• word sequence encoder
• word-level attention layer
• sentence encoder
• sentence-level attention layer

GRU的new state:

### Hierarchical Attention

#### Word Attention

$h_{it}=[\overrightarrow{h_{it}}, \overleftarrow{h_{it}}]$進行一次transformation，然後透過一個content vector $u_w$來進行attention，這裡的$u_w$可以被視為a high level representation of a fixed query “what is the imformative word” over the words like that used in menory networks

Jumping

## Experiments

### Datasets

• 80% training, 10% validation, 10% testing
1. Yelp reviews(有三年，一年一個)
2. IMDB reviews
4. Amazon reviews

### Model configuration and training

• 使用Stanford’s CoreNLP (Manning et al., 2014)切sentence和word
• word2vec做word embedding, dimension=200
• frequency小於5的換成
• GRU dimension=50
• 所以biGRU就會是100
• word/sentence context vector dimension=100
• batch size=64
• optimizer use SGD

### Results and analysis

The experimental results on all data sets are shown in Table 2. We refer to our models as HN-{AVE, MAX, ATT}. Here HN stands for Hierarchical Network, AVE indicates averaging, MAX indicates max-pooling, and ATT indicates our proposed hierarchical attention model. Results show that HNATT gives the best performance across all data sets

HAN棒棒噠

