Back to Papers

DETR: End-to-End Object Detection with Transformers (ECCV 2020)

Sun, 11 Jun 2023

Main ideas

How does it work?

Two ingredients are essential for direct set predictions in detection:

Loss

DETR infers a fixed-size set of $N$ predictions ($N$ significantly larger number of object in an image). The GT set of objects is padded to $N$ by the background (no object $\varnothing$ class).

Each GT element looks like a $y_i = (c_i, b_i)$, where $c_i$ is the target class label (which may be $\varnothing$) and $b_i \in [0, 1]^4$ is a vector that defines ground truth box center coordinates, height and width relative to the image size.

Let’s define the bounding box loss as a linear combination of the $l_1$ loss (which is not scale invariant) and the generalized IoU loss (scale-invariant), these two losses are normalized by the number of objects inside the batch: $$L_{box}(b_i, b_{\sigma(i)}^{pred}) = \lambda_{iou} L_{iou}(b_i, b_{\sigma(i)}^{pred}) + \lambda_{L_1}\Vert{b_i - b_{\sigma(i)}^{pred}}\Vert_1 ,$$ where $L_{iou}(b_i, b_{\sigma(i)}^{pred}) = 1 - L_{giou}(b_i, b_{\sigma(i)}^{pred})$.

First of all, we want to find one-to-one matching for the predictions and GT-boxes. The naive solution of sorting each set by coords (top to bottom, left to right) is bad, because it can incorrectly assign hypothesis-candidates to ground instances when the model produces false positives or false negatives.

The proposed solution is to use Hungarian algorithm with following comparison function (aka matching cost): $$L_{match}(y_i, y_{\sigma(i)}^{pred}) = - p_{\sigma(i)}^{pred}(c_i) + L_{box}(b_i, b_{\sigma(i)}^{pred}) \text{ if } c_i \notin \varnothing \text{ else } 0,$$ where $\sigma$ is current permutation of elements, $p$ – the probability of class $c_i$.
The cost matrix is calculated for all possible pairs, and then a Hungarian algorithm is used to find the optimal assignment (with the lowest cost) of predicted boxes to ground truth boxes: $$\sigma_{optimal} = \underset{\sigma}{\text{argmin}} \sum_{1}^{N}L_{match}(y^i, y_{pred}^{\sigma(i)}).$$

The second step is to calculate the loss function (called Hungarian loss, the name of which can be confusing because it is used after the Hungarian algorithm itself) for the optimal assignment, which is defined as a linear combination of a negative log-likelihood for class prediction and a box loss: $$L_{hungarian}(y, y^{pred}) = \sum_{1}^{N}[- \log p_{\sigma_{opt}(i)}^{pred}(c_i) + \mathbb{1}\{c_i \neq \varnothing\}L_{box}(b_i, b_{\sigma_{opt}(i)}^{pred})],$$ where $\sigma_{opt}$ – the optimal assignment, the log-probability term down-weight by a factor 10 when $c_i = \varnothing$. Notice that the matching cost between an object and $\varnothing$ doesn’t depend on the prediction.

Architecture

I like huggingface overview of architecture.

DETR’s transformer in details:

There are some interesting points I want to mention:

Some technical details

About training:

About the model:

Results

It achieves significantly better performance on large objects than Faster R-CNN, likely thanks to the processing of global information performed by the self-attention. However, it obtains lower performances on small objects.

By using global scene reasoning, the encoder is important for disentangling objects. The visualisation of the attention maps of the last encoder layer of a trained model shows that encoder seems to separate instances already, which likely simplifies object extraction and localization for the decoder.

Whereas for the decoder the visualisation below shows that decoder attention is fairly local, meaning that it mostly attends to object extremities such as heads or legs. This may be because once the encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.

Can be used for panoptic segmentation by adding a mask head on top of the decoder outputs, more precisely by adding a mask head which predicts a binary mask for each of the predicted boxes. To predict the final panoptic segmentation they used an argmax over the mask scores at each pixel, and assign the corresponding categories to the resulting masks.