[論文速速讀]Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

論文網址: https://arxiv.org/pdf/2003.13048.pdf

Abstract

However, all of them perform this operation randomly, without capturing the most important region(s) within an object. In this paper, we propose Attentive CutMix, a naturally enhanced augmentation strategy based on CutMix [3]. In each training iteration, we choose the most descriptive regions based on the intermediate attention maps from a feature extractor, which enables searching for the most discriminative parts in an image.

CutMix的進化版，以往的data augumentation都是random operation。Attentive CutMix透過取出中間層的attention map來挑選最具有解釋性的區域進行CutMix。

Introduction

Attentive CutMix想要基於CutMix的情況下，找出最具代表性的region來進行替換。

下圖高能，非戰鬥人員請迅速撤離

這已經沒在顧及動物的感受了…求那隻狗的心理面積…

Our goal is to learn a more robust network that can attend to the most important part(s) of an object with better recognition performance without incurring any additional testing costs. We achieve this by utilizing the attention maps generated from a pretrained network to guide the localization operation of cutting and pasting among training image pairs in CutMix.

wherein we initially discern the most important parts from an object, then use cut and paste inspired from CutMix to generate a new image which helps the networks better attend to the local regions of an image.

透過pretrained network來得到attention maps，所以並不會造成額外的cost(對於model本身)

不過這個paper只在Cifar-100上進行了實驗而已。

Proposed Approach

Algorithm

Regularization的做法都跟CutMix一樣: 細節可以看論文速速讀: CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
$x\in R^{W\times H\times C}$是圖片, $y$是label，CutMix做的是合成兩張圖片$(x_A, y_A)$和$(x_B, y_B)$然後產生新圖片$(\tilde{x}, \tilde{y})$，透過以下公式:

$\begin{aligned}\tilde{x}&=M\odot x_A+(1-M)\odot{x_B}\\\tilde{y}&=\lambda y_A+(1-\lambda)y_B\end{aligned}$

$M\in {0,1}^{W\times H}$是一個binary mask，在bounding box內的值是0；否則是1
$\odot$是element-wise multiplication
combination ratio$\lambda$來自beta distribution $Beta(\alpha, \alpha)$
- $\lambda$來自Mixup的原始paper

重點是patches怎麼選，在CutMix中是來自一個uniform distribution，不過在這篇用到了attention機制:

We first obtain a heatmap (generally a 7×7 grid map) of the first image by passing it through a pretrained classification model like ResNet-50 and take out the final 7×7 output feature map. We then select the top “N” patches from this 7×7 grid as our attentive region patches to cut from the given image. Here N can range from 1 to 49 (i.e. the entire image itself). Later, we will present an ablation study on the number of attentive patches to be cut from a given image.

透過將一張圖片餵到一個pretrained model，把最後一層的 7x7 feature map取出，選前N個patches，這篇提到N可以是1~49

We then map the selected attentive patches back to the original image. For example, a single patch in a 7×7 grid would map back to a 32×32 image patch on a 224×224 size input image. The patches are cut from the first image and pasted onto the second image at their respective original locations, assuming both images are of the same size. The pair of training samples are randomly selected during each training phase. For the composite label, considering that we pick the top 6 attentive patches from a 7×7 grid, λ would then be 6 49 . Every image in the training batch is augmented with patches cutout from another randomly selected image in the original batch. Please refer Fig. 2 for an illustrative representation of our method.

把attentive patches upsampling回原本的size，然後對應到相同的位置上蓋掉(如果大小相同)

Ex: 對於224x224的image，7x7的每個patch都變回32x32

而對於混和label則是根據使用了幾個patches，例如N=6則$\lambda=6/49$

Theoretical Improvements over CutMix

講說其實CutMix雖然效果很好，但沒有好的理論根據佐證，其中一個假設是: 在隨機蓋掉patch時可能蓋掉了圖像中的重點區塊，使得降低了模型對於特定主題的overfitting。

在這個假設下，CutMix隨機蓋掉部分的做法就不合理了，蓋掉不重要的位置並不能幫助模型robust(例如補丁的其實是一個背景區塊)，這也是為什麼Attentive CutMix會被提出來的原因，希望蓋的部分都是重點部分。

此外，Attentive CutMix的效用很大一部分取決於pretrained model能否有效的找出important part。

Experiments and Analysis

在Cifar-10和Cifar-100上做測試，Attentive CutMix棒棒哒

Ablation Study

此外，N到底該選多少他們也做了實驗(N=1~15)，得到的結論是N=6的結果最好，不過

過少(小於6)的話可能會造成重點區塊無法被遮蓋住
過多的話則會造成遮蔽太多，使得模型沒辦法有效學習

Discussion

需要額外的pretrained model，不過paper認為這項成本對於效能的提升來說小Case啦~