[論文速速讀]CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

論文網址: https://arxiv.org/pdf/1905.04899.pdf

Abstract

…current methods for regional dropout removes informative pixels on training images by overlaying a patch of either black pixels or random noise.
Such removal is not desirable because it leads to information loss and inefficiency during training.

以往的regional dropout技術是在圖片上加上黑色的補釘(patch)或是雜訊來使得model更robust
這樣的方式會造成information loss或是訓練效率降低

We therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches.

切割patch然後把其他張圖片拿來補

Introduction

In particular, to prevent a CNN from focusing too much on a small set of intermediate activations or on a small region on input images, random feature removal regularizations have been proposed. Examples include dropout [33] for randomly dropping hidden activations and regional dropout [2, 49, 32, 7] for erasing random regions on the input. Researchers have shown that the feature removal strategies improve generalization and localization by letting a model attend not only to the most discriminative parts of objects, but rather to the entire object region [32, 7].

為了避免CNN關注到某些特定小區域，一些random feature removal regularizations的技術被提出，向是dropout或是feature removal strategies

研究發現feature removal strategies可以使得模型關注整體的資訊而不是只關注最重要的部分，因此提升了generalization和localization

While regional dropout strategies have shown improvements of classification and localization performances to a certain degree, deleted regions are usually zeroed-out [2, 32] or filled with random noise [49], greatly reducing the proportion of informative pixels on training images

但作者認為regional dropout strategies造成了被dropout的區域值全部都是zero或是noise，導致了過多的information loss

既然如此，把要dropout的區域用其他張圖片來補就好了，這樣既有資訊可以學又可以robust

與其它的方法比較

CutOut就是單純的黑色補丁，但是會有多餘的information loss
Mixup是透過把兩張圖片作線性插值來混合，但這樣會造成混合後的圖片很不自然
CutMix是只針對patch region來補其它圖片，解決了上述兩個缺點

以往的CNN優化技巧有:

regional dropout
data augmentation

剩下的就是介紹CutMix有多棒，然後因為他是在data level進行操作，所以不會影響原本的模型架構

CutMix

Algorithm

$x\in R^{W\times H\times C}$是圖片, $y$是label，CutMix做的是合成兩張圖片$(x_A, y_A)$和$(x_B, y_B)$然後產生新圖片$(\tilde{x}, \tilde{y})$，透過以下公式:

$\begin{aligned}\tilde{x}&=M\odot x_A+(1-M)\odot{x_B}\\\tilde{y}&=\lambda y_A+(1-\lambda)y_B\end{aligned}$

$M\in {0,1}^{W\times H}$是一個binary mask，在bounding box內的值是0；否則是1
$\odot$是element-wise multiplication
combination ratio$\lambda$來自beta distribution $Beta(\alpha, \alpha)$
- $\lambda$來自Mixup的原始paper

接下來定義bounding box $B=(r_x, r_y, r_w, r_h)$

該區域會將B的內容crop下來蓋到A上面

論文的實驗中使用的是rectangular masks(aspect ratio和原圖一樣)，box coordinates來自uniformly distribution:

$\begin{aligned}r_x&\sim Uniform(0, W), r_w=W\sqrt{1-\lambda} \\ r_y&\sim Uniform(0, H), r_y=H\sqrt{1-\lambda}\end{aligned}$

實作細節簡單易懂，取一個minibatch，之後shuffle再取一次minibatch，然後根據上面公式產生新的label:

source code: clovaai/CutMix-PyTorch

決定bounding box:

def rand_bbox(size, lam):
    W = size[2]
    H = size[3]
    cut_rat = np.sqrt(1. - lam)
    cut_w = np.int(W * cut_rat)
    cut_h = np.int(H * cut_rat)

    # uniform
    cx = np.random.randint(W)
    cy = np.random.randint(H)

    bbx1 = np.clip(cx - cut_w // 2, 0, W)
    bby1 = np.clip(cy - cut_h // 2, 0, H)
    bbx2 = np.clip(cx + cut_w // 2, 0, W)
    bby2 = np.clip(cy + cut_h // 2, 0, H)

    return bbx1, bby1, bbx2, bby2

取一個minibatch，將該batch的所有圖片都用同一張B來合成
注意到實際上這張合成圖混和label的意思，是指在實作上算loss時使用不同的label來算cross entropy並加權平均

for i, (input, target) in enumerate(train_loader):
    # measure data loading time
    data_time.update(time.time() - end)

    input = input.cuda()
    target = target.cuda()

    r = np.random.rand(1)
    if args.beta > 0 and r < args.cutmix_prob:
        # generate mixed sample
        lam = np.random.beta(args.beta, args.beta)
        rand_index = torch.randperm(input.size()[0]).cuda()
        target_a = target
        target_b = target[rand_index]
        bbx1, bby1, bbx2, bby2 = rand_bbox(input.size(), lam)
        input[:, :, bbx1:bbx2, bby1:bby2] = input[rand_index, :, bbx1:bbx2, bby1:bby2]
        # adjust lambda to exactly match pixel ratio
        lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (input.size()[-1] * input.size()[-2]))
        # compute output
        output = model(input)
        loss = criterion(output, target_a) * lam + criterion(output, target_b) * (1. - lam)
    else:
        # compute output
        output = model(input)
        loss = criterion(output, target)

Discussion

Mixup混和了兩張圖片，造成模型無法正確知道要關注什麼部分
Cutout再只有兩個類別的時候效果不錯，因為遮掉另一個類別所以可以區分剩下的類別；不過這樣就無法有效的關注另一個類別了
CutMix棒棒哒，上面的兩個問題都解決了