[論文速速讀]End-to-end object detection with Transformers

〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

DETR，是DEtection TRansformer的縮寫，FB首度將NLP的transformer用在CV的object detection上，還不用做NMS。
FB真滴神。已經沒有任何東西可以阻擋attention了…

論文網址: https://arxiv.org/pdf/2005.12872.pdf

Abstract

We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task.

將object detection視作一個direct set prediction problem。並且精簡了很多object detection上的額外操作(non-maximum suppression, anchor generation)

The main ingredients of the new framework, called DEtection TRansformer or
DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture.

這段提到了這個架構的重點在於

set-based global loss that forces unique predictions via bipartite matching，也就是透過二分匹配的一個global loss
transformer的encoder-decoder架構

Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

透過transformer使得DETR可以考慮object跟整體image之間的關係，然後去預測最後的結果。並且不用額外的操作(這很重要，在objection detection上有太多繁瑣的額外工作要處理，這些在DETR上都不用做了!!)，最後他們的結果比Faster RCNN好。儘管Faster RCNN已經不是最新的方法，現在有很多方法都超越了，不過你能想像這是第一次把transformer應用在CV領域上，這是一件多麼令人興奮的事情!

FB也釋出了他們的code: github

Introduction

We streamline the training pipeline by viewing object detection as a direct set prediction problem. We adopt an encoder-decoder architecture based on transformers, a popular architecture for sequence predictionThe self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing duplicate predictions.

將object detection視為一個set prediction的問題。使用transformer的encoder-decoder架構

因為transformer的self-attention機制可以對element之間的交互關係來進行建模，這個性質在一些specific constraints of set prediction上的應用非常適合，例如刪除重複的predictions

Our DEtection TRansformer (DETR, see Figure 1) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects.

DETR可以一次就預測出所有物體，並且他是一個end-to-end的架構，搭配了set loss function來針對prediction和ground truth進行二分匹配(bipartite matching)，下面是架構圖，對CNN跟transformer(不懂的可以去看我寫的transformer的論文介紹論文速速讀: Attention Is All You Need)有一些概念的並不難理解，也就是:

圖片經過CNN後，把學到的features丟到transformer(怎麼丟也是一門學問，後續會說)，
然後transformer透過decoder來predict set of objects，每個predict都包含了類別跟bounding box，
然後這個prediction會跟ground truth來計算loss

所以重點就在於bipartite matching loss了

Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel

透過這個matching loss function對prediction和ground truth object進行匹配，並且在預測物件上的排列順序是無關的，所以可以平行處理

Our experiments show that our newmodel achieves comparable performances. More precisely, DETR demonstrates significantly better performance on large objects, a result likely enabled by the non-local computations of the transformer

他們發現DETR在較大物件上有更好的performance，可能是因為transformer的non-local computations所造成(也就是說他可以關注大範圍的information)

DETR使用了下列多個技術:

bipartite matching losses for set prediction
encoder-decoder architectures based on the transformer
parallel decoding
object detection methods.

Set Prediction

Most current detectors use postprocessings such as
non-maximal suppression to address this issue, but direct set prediction are postprocessing-free. They need global inference schemes that model interactions between all predicted elements to avoid redundancy

object detection很大的一個問題就是要如何去解決重複的prediction，一個方法是透過non-maximal suppression(NMS)的後處理來刪除重複的prediction。但是這個問題在set prediction上則不用這樣做，他們可以透過對predicted elements之間的關係進行inference來解決這個issue。

　The usual solution is to design a loss based on the Hungarian algorithm, to find a bipartite matching between ground-truth and prediction.
This enforces permutation-invariance, and guarantees that each target element has a unique match. We follow the bipartite matching loss approach.

(上面這一段英文是本篇重點)，透過匈牙利算法(Hungarian algorithm)去一對一的匹配prediction跟ground truth，並且這個結果預位置無關(permutation-invariance)，因為不管prediction在set的哪個位置上，都會透過algo找到與他最相似的ground。

此外，這一段也講到傳統在做set prediction時，會使用RNN，不過這裡使用的是transformer。合理，RNN系列本來就可以被attention替換掉，看看NLP近年的研究，哪一個RNN成果沒被attention洗過一輪的?

Transformers and Parallel Decoding

介紹Transformer，不懂的可以去看論文速速讀: Attention Is All You Need

Object detection

跳過，覺得這裡比較不重要，有興趣自己去細看

The DETR model

Two ingredients are essential for direct set predictions in detection:
(1) a set prediction loss that forces unique matching between predicted and ground truth boxes;
(2) an architecture that predicts (in a single pass) a set of objects and models their relation.
We describe our architecture in detail in Figure 2.

DETR的兩個重點:

一個set prediction loss來評估prediction跟ground truth boxes對應關係的好壞，說白了就是前面說的Hungarian algorithm
一個可以去預測一組objects的模型架構，說白了就是我大transformer+CNN

Object detection set prediction loss

DETR infers a fixed-size set of N predictions, in a single pass through the decoder, where N is set to be significantly larger than the typical number of objects in an image. One of the main difficulties of training is to score predicted objects (class, position, size) with respect to the ground truth

一開始會設置一個$N$，代表要預測幾個object，這個$N$要被設置好才好來進行prediction跟ground truth之間的匹配(通常會設一個普遍大於image所應該有的object數量)。接下來就是需要一個好的score function來評估匹配的prediction跟ground truth的好壞(object類別、位置跟大小)

下面來介紹一下loss function，首先先定義一些notation:

$y$是ground truth的objects set，$\hat{y} = { \hat{y_i} }^N_{i=1}$是predicts set
這邊會假設$N$是一個大於image所應應有的object數量，不過這樣後續匹配的時候就必須對ground truth去做padding才能湊到N組object，pad的方式是使用用$\varnothing$(代表沒有object)來padding
$y_i$是ground truth的一組object，每一個$y_i$包含了class labe和bounding box的四個值: $y_{i}=\left(c_{i}, b_{i}\right)$(center coordinates, width, height relative to image size)
對於一個$y_i$，他對應到的prediction是$\hat{y}_{\sigma(i)}$
- 如何找到$y_i$應該對應哪一個$\hat{y}_{\sigma(i)}$? 用前面提到的匹配演算法Hungarian algorithm

於是loss function可以寫成:

$\mathcal{L}_{\text {Hungarian }}(y, \hat{y})= \\ \sum_{i=1}^{N}\left[-\log \hat{p}_{\hat{\sigma}(i)}\left(c_{i}\right)+1_{\left\{c_{i} \neq \varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\hat{\sigma}}(i)\right)\right]$

$-\log \hat{p}_{\hat{\sigma}(i)}\left(c_{i}\right)$是希望predict分類分的越準越好
- $\hat{p}_{\sigma(i)}\left(c_{i}\right)$是$\sigma (i)$被預測維class $c_i$的機率
- 當分的正確的時候(也就是機率=1)這一項就是0
則是bounding box的重合度匹配
- 使用了L1 loss和IOU(單純用L1 loss會造成這個loss過於依賴bounding box的大小，所以加上了IOU)

Box Loss詳細的公式如下所示:

$\mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\sigma(i)}\right)=\lambda_{\mathrm{iou}} \mathcal{L}_{\mathrm{iou}}\left(b_{i}, \hat{b}_{\sigma(i)}\right)+\lambda_{\mathrm{L} 1}\left\|b_{i}-\hat{b}_{\sigma(i)}\right\|_{1}$

These two losses are normalized by the number of objects inside the batch

最後這兩個loss會根據batch內objs的數量來做normalization

DETR architecture

總體架構如下:

Backbone: 先跑CNN，然後把最後一層透過Conv1d降維，得到了feature map $z_{0} \in \mathbb{R}^{d \times H \times W}$，然後轉成$(d, HW)$的shape等下準備餵進transformer
Transformer encoder: 然後把1.的output丟到transformer encoder上做multi-head self-attention
- transformer只在一開始加了position encoding，TEDR覺得一次不夠，每一個encoder block都給你加好加滿
Transformer decoder: 在transformer decoder的部分，使用N個d維的vector來當作object query vector
- decoder也給你每層加position encoding，值得一提的是這裡的position encoding是直接用query vector
Prediction feed-forward networks (FFNs): 最後接FFN(feed-forward networks)產生N個prediction，前面有提到這個N會比image object數量還來的大，所以會有一個特殊的class叫做no object，可以把他想成圖片的背景
Auxiliary decoding losses: 他們發現在decoder加上auxiliary losses能夠得到比較好的結果

上面2和3的部分是transfromer，可以搭配下面架構圖一同服用

Experiments

呼…講了這麼多，終於來到實驗章節了…我要準備加速了唷，請綁好安全帶…

在COCO 2017 dataset上實驗，然後和Faster RCNN比。CNN架構採用了ResNet，同時比較了ResNet50/101和有沒有用dilation的結果

Ablations

In Figure 3, we visualize the attention maps of the last encoder layer of a trained model, focusing on a few points in the image. The encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.

首先來看encoder最後一層的attention maps拿來做visualization，可以看到其實encoder已經可以分辨出不同的instance了，作者認為這可以幫助減輕後面decoder的工作

Similarly to visualizing encoder attention, we visualize decoder attentions in Fig. 6, coloring attention maps for each predicted object in different colors. We observe that decoder attention is fairly local, meaning that it mostly attends to object extremities such as heads or legs. We hypothesise that after the encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.

接下來看decoder最後一層的attention maps，可以發現此時的attention集中在一些動物的腳r、鼻子r、頭阿…他們認為，在decoder階段模型只需要去注意這些動物的肢體來決定他們的bounding box

Analysis

下面這張圖可視化了N=100中的20個object query和他們查到的bounding box。可以看到不同的query vector具有不同的分布(有些總是注重左下角，有些總是注重右邊…)，可以想成: 有N個不同的人用不同的角度來詢問模型

然後來看一個例子:下面這張長頸鹿圖。他想說的是在training中沒有一張圖是同時有超過13個長頸鹿的，但是在predict的時候TEDR卻能成功分辨這24隻長頸鹿。

DETR for panoptic segmentation

DETR可以透過在decoder後面加一些mask head來達到全景分割…細節有興趣的自己看…我累了…

心得

TEDR把NLP上很紅的transformer搬過來CV領域了…

想當初transfromer一出，NLP所有論文都被刷過一輪，未來CV會不會也走上這條道路呢xD

不過用image做self-attention的想法蠻有意思的，看了某個視頻分享後，覺得其實他背後的概念很符合CV上的注意力概念，我覺得後續可以另外開一篇來進一步介紹。