# [論文速速讀]End-to-end object detection with Transformers

Posted by John on 2020-06-04
〖想觀看更多中文論文導讀，至[論文速速讀]系列文章介紹可以看到目前已發布的所有文章！〗

DETR，是DEtection TRansformer的縮寫，FB首度將NLP的transformer用在CV的object detection上，還不用做NMS。
FB真滴神。已經沒有任何東西可以阻擋attention了…

## Abstract

We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task.

The main ingredients of the new framework, called DEtection TRansformer or
DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture.

1. set-based global loss that forces unique predictions via bipartite matching，也就是透過二分匹配的一個global loss
2. transformer的encoder-decoder架構

Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

FB也釋出了他們的code: github

## Introduction

We streamline the training pipeline by viewing object detection as a direct set prediction problem. We adopt an encoder-decoder architecture based on transformers, a popular architecture for sequence predictionThe self-attention mechanisms of transformers, which explicitly model all pairwise interactions between elements in a sequence, make these architectures particularly suitable for specific constraints of set prediction such as removing duplicate predictions.

Our DEtection TRansformer (DETR, see Figure 1) predicts all objects at once, and is trained end-to-end with a set loss function which performs bipartite matching between predicted and ground-truth objects.

DETR可以一次就預測出所有物體，並且他是一個end-to-end的架構，搭配了set loss function來針對prediction和ground truth進行二分匹配(bipartite matching)，下面是架構圖，對CNN跟transformer(不懂的可以去看我寫的transformer的論文介紹論文速速讀: Attention Is All You Need)有一些概念的並不難理解，也就是:

1. 圖片經過CNN後，把學到的features丟到transformer(怎麼丟也是一門學問，後續會說)，
2. 然後transformer透過decoder來predict set of objects，每個predict都包含了類別跟bounding box，
3. 然後這個prediction會跟ground truth來計算loss

Our matching loss function uniquely assigns a prediction to a ground truth object, and is invariant to a permutation of predicted objects, so we can emit them in parallel

Our experiments show that our newmodel achieves comparable performances. More precisely, DETR demonstrates significantly better performance on large objects, a result likely enabled by the non-local computations of the transformer

DETR使用了下列多個技術:

• bipartite matching losses for set prediction
• encoder-decoder architectures based on the transformer
• parallel decoding
• object detection methods.

### Set Prediction

Most current detectors use postprocessings such as
non-maximal suppression to address this issue, but direct set prediction are postprocessing-free. They need global inference schemes that model interactions between all predicted elements to avoid redundancy

object detection很大的一個問題就是要如何去解決重複的prediction，一個方法是透過non-maximal suppression(NMS)的後處理來刪除重複的prediction。但是這個問題在set prediction上則不用這樣做，他們可以透過對predicted elements之間的關係進行inference來解決這個issue。

The usual solution is to design a loss based on the Hungarian algorithm, to find a bipartite matching between ground-truth and prediction.
This enforces permutation-invariance, and guarantees that each target element has a unique match. We follow the bipartite matching loss approach.

(上面這一段英文是本篇重點)，透過匈牙利算法(Hungarian algorithm)去一對一的匹配prediction跟ground truth，並且這個結果預位置無關(permutation-invariance)，因為不管prediction在set的哪個位置上，都會透過algo找到與他最相似的ground。

## The DETR model

Two ingredients are essential for direct set predictions in detection:
(1) a set prediction loss that forces unique matching between predicted and ground truth boxes;
(2) an architecture that predicts (in a single pass) a set of objects and models their relation.
We describe our architecture in detail in Figure 2.

DETR的兩個重點:

1. 一個set prediction loss來評估prediction跟ground truth boxes對應關係的好壞，說白了就是前面說的Hungarian algorithm
2. 一個可以去預測一組objects的模型架構，說白了就是我大transformer+CNN

### Object detection set prediction loss

DETR infers a fixed-size set of N predictions, in a single pass through the decoder, where N is set to be significantly larger than the typical number of objects in an image. One of the main difficulties of training is to score predicted objects (class, position, size) with respect to the ground truth

1. $y$是ground truth的objects set，$\hat{y} = { \hat{y_i} }^N_{i=1}$是predicts set
2. 這邊會假設$N$是一個大於image所應應有的object數量，不過這樣後續匹配的時候就必須對ground truth去做padding才能湊到N組object，pad的方式是使用用$\varnothing$(代表沒有object)來padding
3. $y_i$是ground truth的一組object，每一個$y_i$包含了class labe和bounding box的四個值: $y_{i}=\left(c_{i}, b_{i}\right)$(center coordinates, width, height relative to image size)
4. 對於一個$y_i$，他對應到的prediction是$\hat{y}_{\sigma(i)}$
• 如何找到$y_i$應該對應哪一個$\hat{y}_{\sigma(i)}$? 用前面提到的匹配演算法Hungarian algorithm

• $-\log \hat{p}_{\hat{\sigma}(i)}\left(c_{i}\right)$是希望predict分類分的越準越好
• $\hat{p}_{\sigma(i)}\left(c_{i}\right)$是$\sigma (i)$被預測維class $c_i$的機率
• 當分的正確的時候(也就是機率=1)這一項就是0
• $1_{\left\{c\_{i} \neq \varnothing\right\}} \mathcal{L}_{\mathrm{box}}\left(b_{i}, \hat{b}_{\hat{\sigma}}(i)\right)$則是bounding box的重合度匹配
• 使用了L1 loss和IOU(單純用L1 loss會造成這個loss過於依賴bounding box的大小，所以加上了IOU)

Box Loss詳細的公式如下所示:

These two losses are normalized by the number of objects inside the batch

### DETR architecture

1. Backbone: 先跑CNN，然後把最後一層透過Conv1d降維，得到了feature map $z_{0} \in \mathbb{R}^{d \times H \times W}$，然後轉成$(d, HW)$的shape等下準備餵進transformer
2. Transformer encoder: 然後把1.的output丟到transformer encoder上做multi-head self-attention
• transformer只在一開始加了position encoding，TEDR覺得一次不夠，每一個encoder block都給你加好加滿
3. Transformer decoder: 在transformer decoder的部分，使用N個d維的vector來當作object query vector
• decoder也給你每層加position encoding，值得一提的是這裡的position encoding是直接用query vector
4. Prediction feed-forward networks (FFNs): 最後接FFN(feed-forward networks)產生N個prediction，前面有提到這個N會比image object數量還來的大，所以會有一個特殊的class叫做no object，可以把他想成圖片的背景
5. Auxiliary decoding losses: 他們發現在decoder加上auxiliary losses能夠得到比較好的結果

## Experiments

### Ablations

In Figure 3, we visualize the attention maps of the last encoder layer of a trained model, focusing on a few points in the image. The encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.

Similarly to visualizing encoder attention, we visualize decoder attentions in Fig. 6, coloring attention maps for each predicted object in different colors. We observe that decoder attention is fairly local, meaning that it mostly attends to object extremities such as heads or legs. We hypothesise that after the encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.