[論文速速讀]Attention is not Explanation

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

NAACL2019 2019的一篇文章，旨在透過一系列的實驗提出一種新的看法:attention機制並不是解釋性，我們不能用attention來說作為XAI的技術。這其實是個蠻特別的觀點，Attention自從2016年崛起後，一堆研究都是基於他而擴展的，現在這篇論文出來卻是打臉了attention的可解釋性。

…所以以後不能再用attention來做XAI(Explainable AI)了嗎? QQ饅頭喔

也別難過得太早，同一年又有另一篇論文跳出來了，EMNLP 2019的”Attention is not not Explanation”，看這個標題就知道擺明就是要跟這篇對著幹，所以先看完這篇再去看如何被反駁好像也是蠻有趣的?

題外話，我是站Attention is not not Explanation這邊，希望那篇可以狠狠打臉這篇xD

論文網址: https://arxiv.org/pdf/1902.10186.pdf

Abstract

In addition to improving predictive performance, these are often touted as affording transparency: models equipped with attention provide a distribution over attended-to input units, and this is often presented (at least implicitly) as communicating the relative importance of inputs. However, it is unclear what relationship exists between attention weights and model outputs.
In this work we perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful “explanations” for predictions

attention雖然已經被證實可以提升效果，但是很多人也同時把他吹捧成具有透明性(transparency)，可以幫助解釋深度學習這個黑盒子。

這篇的作者認為，attention跟model之間的關係其實尚不明朗，不准你這麼說!

所以他們設計了一系列的NLP實驗來度量attention對於模型預測時提供的解釋性分數，然後他們發現:

注意力的權重和使用gradient-based來評估模型好壞的方法其實沒有甚麼相關性
不同的注意力分布可以產生相同的預測效果(如果注意力是指模型的重要性的話，那為何不同的重要性分布可以有一樣的效果呢?)

所以，他們認為不能也不應該把attention當作提供解釋性的方法(不過本質上他們並沒有否定attention唷)。

Github code的部分在: https://github.com/successar/AttentionExplanation

Introduction and Motivation

Li et al. (2016) summarized this commonly held view in NLP: “Attention provides an important way to explain the workings of neural models”. Indeed, claims that attention provides interpretability are common in the literature

2016年後，普遍認為attention是可以來幫助解釋模型運作的一種注意力機制。這項共識來自於一個假設:

注意力權重高的feature比較大的程度影響了模型的output，所以我們可以說他是這個task的重要特徵

不過這項假設實際上並沒有正式的被評估過。所以就有了這篇。

Assuming attention provides a faithful explanation for model predictions, we might expect the following properties to hold.

Attention weights should correlate with feature importance
measures (e.g., gradient-based measures);

Alternative (or counterfactual) attention weight configurations ought to yield corresponding changes in prediction (and if they do not then are equally plausible as explanations).

假設注意力機制可以為模型提供合理的解釋性，那他們期望下列的幾個性質也是同時存在的:

attention weight應該跟feature important measures(例如gradient-based的方法)具有相關性
不同的attention weight分布應該會造成不同的prediction，因為注意的地方不同了。反之，如果不會造成不同的prediction，則可以用來當作他是解釋性的一種依據

然後他們說他們在很多種NLP task上做實驗，可是卻沒發現上述這兩種現象:<

接下來他們用了一個例子來描述2的現象:
使用了一個標準具有attention的BiLSTM網路來做情感分析，模型原始的attention distribution $\alpha$下發現asking, waste是造成負面情緒($y=0.01$)的重要token。

他們發現這樣子的attention weight和gradient-based measures of feature importance的相關性非常的小
他們另外造了一個分布$\tilde{\alpha}$(此時重要的token變成了myself, was)卻能得到一樣的prediction。如果更換其他分布對於預測的準確度也不會降低太多(只有0.006的誤差)

We thus caution against using attention weights to highlight input tokens “responsible for” model outputs and onstructing just-so stories on this basis.

基於上述的發現，他們告誡大家不要再把attention當作解釋性來用惹。

Research questions and contributions

We investigate whether this holds across tasks by exploring the following empirical questions.

To what extent do induced attention weights correlate with measures of feature importance – specifically, those resulting from gradients and leave-one-out (LOO) methods?

Would alternative attention weights (and hence distinct heatmaps/“explanations”) necessarily yield different predictions?

作者通過以上的問題設置來進行多個NLP的實驗

attention weight和使用gradient-based/LOO得出的feature importance之間的相關性?
不同的attention weight是否會導致不同的prediction?

Datasets and Tasks

在下面三個領域、多個資料集上來驗證上面的這兩個問題

binary text classification
Question Answering (QA)
Natural Language Inference (NLI)

Experiments

實驗的詳細結果的detail可以在這個網站上看到: Attention is not Explanation — Results

Correlation Between Attention and Feature Importance Measures

比較了1. gradient-based 2. leave one out(LOO)的相關性，algo如下圖所示

gradient-based的就是根據backward去回推gradient
LOO就是看移除某個token $t$時，prediction的TVD與attention weight之間的差異

Note that we disconnect the computation graph at the attention module so that the gradient does not flow through this layer: This means the gradients tell us how the prediction changes as a function of inputs, keeping the attention distribution fixed.

值得注意的地方是，他們把attention layer的computer graph斷開了，也就是說在做gradient backward的時候gradient並不會經過這一層

這裡指的是說，forward照做，可是backward的時候attention layer的前一層gradient會直接傳到attention layer的後一層，attention layer這一層並不會去參與到backward的計算
作者說這樣做是想要考慮同樣具有attention機制下，gradient-based的評估方法跟attention weight的相關性

詳細的結果如下表所示:

可以看到有三個欄位，分別是:

BiLSTM下gradient-based方法跟attention weight的相關性
- BiRNN + attention
Average方法的gradient-based方法跟attention weight的相關性
- 將token經過linear projection layer(以及ReLU)的encoding + attention
BiLSTM下的LOO方法跟attention weight的相關性

可以看到，BiLSTM下的相關性大多都低於0.5，除了一些dataset(Diabetes和一些QA corpora)相關性相對比較大。作者認為因為這些dataset提供了足夠大的樣本數，這可以讓gradient-based和attention的相關性好一點(不過還是相對差於其他的方法)

再來，數據發現基於簡單的linear projection embedding的model和gradient-based的相關性甚至還比較大。他們假設問題出在BiRNN會看過所有的input然後才產生hidden state，所以這個hidden state包含了整體的information。

接下來他們又去分析了gradient-based和LOO的相關性(這裡沒放圖，可以對照論文的Fig.3-Fig.5)，發現gradient-based和LOO的相關性確實比attention來的高

細節可以再去看原文，不過這一章節主要就是反應了作者的第一個問題，也就是，attention和gradient-based的相關性其實並不高。並且用單純的方法來做embedding+attention反倒比用什麼BiRNN得到的相關性還高。這些數據讓人重新思考注意力到底具不具有解釋性。

Counterfactual Attention Weights

接下來他們想要看的是，不同的注意力分佈是否能夠產生出相同的預測？

因為如果按照”注意力機制是解釋性”的想法，作者認為不同的分佈會對於模型的輸出產生不同的結果。

We experiment with two means of constructing such distributions.

First, we simply scramble the original attention weights $\hat{\alpha}$, re-assigning each value to an arbitrary, randomly sampled index (input feature).

Second, we generate an adversarial attention distribution: this is a set of attention weights that is maximally distinct from $\hat{\alpha}$ but that nonetheless yields an equivalent prediction (i.e., prediction within some of $\hat{y}$)

第一個測驗方法是單純的打亂注意力的排列順序，演算法如下:

第二種為產生一組adversarial attention distribution $\alpha$(在能夠產生相同prediction下與原本分布差異最大的分佈)，產生的方法主要是解一個最佳化問題

$\begin{array}{ll} \underset{\alpha^{(1)}, \ldots, \alpha^{(k)}}{\operatorname{maximize}} & f\left(\left\{\alpha^{(i)}\right\}_{i=1}^{k}\right) \\ \text { subject to } & \forall i \operatorname{TVD}\left[\hat{y}\left(\mathbf{x}, \alpha^{(i)}\right), \hat{y}(\mathbf{x}, \hat{\alpha})\right] \leq \epsilon \end{array}$

其中， $f\left(\left\{\alpha^{(i)}\right\}_{i=1}^{k}\right)$ 為:

$\sum_{i=1}^{k} \operatorname{JSD}\left[\alpha^{(i)}, \hat{\alpha}\right]+\frac{1}{k(k-1)} \sum_{i=1} \operatorname{JSD}\left[\alpha^{(i)}, \alpha^{(j)}\right]$

兩個子實驗的Detail我就不講了，自己去看，不過大致上就是作者發現可以透過不同的attention distribution來產生出相同的prediction

Discussion and Conclusions

作者透過一系列的實驗來驗證提出的兩個問題:

attention跟gradient-based相關性其實並不高
不同的attention distribution可以有相同的prediction

並用了這兩個來結論說，他們認為attention儘管可以幫助模型提升效能，但他背後的機制目前尚未明朗，我們不能把它當作Explainable的技術來看

再來，這篇文章有一些important limitations

使用gradient-based的方法來比較相關性，這其實是把gradient-based當作ground truth來看了，但其實他們也不能完全代表真正的可解釋性
要如何評估attention和gradient-based的相關性要多少才叫好？其實誰也不知道
評估標準使用了Kendall $\tau$ measure，這個方法再有不相關的特徵時會因噪聲而帶給了不夠精準的評估
實際上這篇文章只在少部分的attention varients上進行實驗(在文章中只用了BiLSTM)
儘管不同的attention distribution可以產生相同的prediction，但他們不否認可能同時存在多種解釋性，也就是說模型今天可能透過不同的注意力組合來得到相同的推論。
最後，目前只在分類任務上做實驗，他們把其他任務當作future work了(ㆆᴗㆆ)

結論

雖然論文落落長，不過主要就是瘋狂做實驗來驗證兩個問題:

attention weight是否跟feature important measures(例如gradient-based的方法)具有相關性，如果有的話則可以說明具有解釋性
不同的attention weight分布應該會造成不同的prediction，因為注意的地方不同了；反之，如果不會造成不同的prediction，則可以用來當作他是解釋性的一種依據

如果這兩個性質不存在，作者認為就不能將attention作為解釋性的一種技術依據。

不過其實後來在”Attention is not not Explanation”中，有對這篇的論點還有做實驗的方法提出了一些觀點，認為這篇的論點還不足以說attention不具解釋性(注意我的用詞，並不是說他們證明了attention具有解釋性而是說認為attention不能說不具有解釋性喔…我沒有在玩繞口令)。

兩篇都看完的話，你到底站哪一邊呢~