# [論文速速讀]Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification

Posted by John on 2020-04-26
Words 1.1k and Reading Time 5 Minutes
Viewed Times

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

## Abstract

However, all of them perform this operation randomly, without capturing the most important region(s) within an object. In this paper, we propose Attentive CutMix, a naturally enhanced augmentation strategy based on CutMix [3]. In each training iteration, we choose the most descriptive regions based on the intermediate attention maps from a feature extractor, which enables searching for the most discriminative parts in an image.

CutMix的進化版，以往的data augumentation都是random operation。Attentive CutMix透過取出中間層的attention map來挑選最具有解釋性的區域進行CutMix。

## Introduction

Attentive CutMix想要基於CutMix的情況下，找出最具代表性的region來進行替換。

Our goal is to learn a more robust network that can attend to the most important part(s) of an object with better recognition performance without incurring any additional testing costs. We achieve this by utilizing the attention maps generated from a pretrained network to guide the localization operation of cutting and pasting among training image pairs in CutMix.

wherein we initially discern the most important parts from an object, then use cut and paste inspired from CutMix to generate a new image which helps the networks better attend to the local regions of an image.

## Proposed Approach

### Algorithm

Regularization的做法都跟CutMix一樣: 細節可以看論文速速讀: CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
$x\in R^{W\times H\times C}$是圖片, $y$是label，CutMix做的是合成兩張圖片$(x_A, y_A)$和$(x_B, y_B)$然後產生新圖片$(\tilde{x}, \tilde{y})$，透過以下公式:

• $M\in {0,1}^{W\times H}$是一個binary mask，在bounding box內的值是0；否則是1
• $\odot$是element-wise multiplication
• combination ratio$\lambda$來自beta distribution $Beta(\alpha, \alpha)$
• $\lambda$來自Mixup的原始paper

We first obtain a heatmap (generally a 7×7 grid map) of the first image by passing it through a pretrained classification model like ResNet-50 and take out the final 7×7 output feature map. We then select the top “N” patches from this 7×7 grid as our attentive region patches to cut from the given image. Here N can range from 1 to 49 (i.e. the entire image itself). Later, we will present an ablation study on the number of attentive patches to be cut from a given image.

We then map the selected attentive patches back to the original image. For example, a single patch in a 7×7 grid would map back to a 32×32 image patch on a 224×224 size input image. The patches are cut from the first image and pasted onto the second image at their respective original locations, assuming both images are of the same size. The pair of training samples are randomly selected during each training phase. For the composite label, considering that we pick the top 6 attentive patches from a 7×7 grid, λ would then be 6 49 . Every image in the training batch is augmented with patches cutout from another randomly selected image in the original batch. Please refer Fig. 2 for an illustrative representation of our method.

• Ex: 對於224x224的image，7x7的每個patch都變回32x32

## Experiments and Analysis

### Ablation Study

• 過少(小於6)的話可能會造成重點區塊無法被遮蓋住
• 過多的話則會造成遮蔽太多，使得模型沒辦法有效學習

>