这是用户在 2024-4-18 13:07 为 https://ar5iv.labs.arxiv.org/html/1708.02002?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Focal Loss for Dense Object Detection
致密物体检测的焦距损失

Tsung-Yi Lin  Priya Goyal  Ross Girshick  Kaiming He  Piotr Dollár
林宗义 普里亚·戈亚尔 罗斯·吉尔希克·凯明 何·皮奥特·多拉尔

Facebook AI Research (FAIR)
Facebook AI 研究 (FAIR)
Abstract 抽象

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
迄今为止,最高精度的物体检测器基于R-CNN推广的两阶段方法,其中分类器应用于一组稀疏的候选物体位置。相比之下,应用于对可能物体位置进行常规、密集采样的单级探测器有可能更快、更简单,但迄今为止的精度落后于两级探测器。在本文中,我们研究了为什么会这样。我们发现,在密集探测器训练过程中遇到的极端前景-背景类不平衡是核心原因。我们建议通过重塑标准交叉熵损失来解决这种类不平衡,从而降低分配给分类良好的示例的损失的权重。我们新颖的焦点损失将训练集中在一组稀疏的硬示例上,并防止大量简单的负片在训练过程中压倒探测器。为了评估损失的有效性,我们设计并训练了一个简单的密集探测器,我们称之为RetinaNet。我们的结果表明,当使用焦点损失进行训练时,RetinaNet能够与以前的单级探测器的速度相匹配,同时超过所有现有最先进的两级探测器的精度。代码位于:https://github.com/facebookresearch/Detectron。

\begin{overpic}[width=429.28616pt]{figures/loss} \put(18.0,54.0){\small$\textrm{CE}(p_{\textrm{t}})=-\log(p_{\textrm{t}})$} \put(19.0,48.0){\small$\textrm{FL}(p_{\textrm{t}})=-(1-p_{\textrm{t}})^{\gamma}\log(p_{\textrm{t}})$} \end{overpic}
\begin{overpic}[width=429.28616pt]{figures/loss} \put(18.0,54.0){\small$\textrm{CE}(p_{\textrm{t}})=-\log(p_{\textrm{t}})$} \put(19.0,48.0){\small$\textrm{FL}(p_{\textrm{t}})=-(1-p_{\textrm{t}})^{\gamma}\log(p_{\textrm{t}})$} \end{overpic}
Figure 1: We propose a novel loss we term the Focal Loss that adds a factor (1pt)γsuperscript1subscript𝑝t𝛾(1-p_{\textrm{t}})^{\gamma} to the standard cross entropy criterion. Setting γ>0𝛾0\gamma>0 reduces the relative loss for well-classified examples (pt>.5subscript𝑝t.5p_{\textrm{t}}>.5), putting more focus on hard, misclassified examples. As our experiments will demonstrate, the proposed focal loss enables training highly accurate dense object detectors in the presence of vast numbers of easy background examples.
图 1:我们提出了一种新的损失,我们称之为焦点损失,它为标准交叉熵准则增加了一个因素 (1pt)γsuperscript1subscript𝑝t𝛾(1-p_{\textrm{t}})^{\gamma} 。设置 γ>0𝛾0\gamma>0 可减少分类良好的示例 ( pt>.5subscript𝑝t.5p_{\textrm{t}}>.5 ) 的相对损失,从而将更多注意力放在硬的、错误分类的示例上。正如我们的实验所证明的那样,所提出的焦点损失能够在存在大量简单背景示例的情况下训练高精度的密集目标探测器。
Refer to caption
AP time
[A] YOLOv2 [27] 21.6 25
[B] SSD321 [22] [B] 固态硬盘321 [ 22] 28.0 61
[C] DSSD321 [9] [三] DSSD321 [ 9] 28.0 85
[D] R-FCN [3] 29.9 85
[E] SSD513 [22] [E] 固态硬盘513 [ 22] 31.2 125
[F] DSSD513 [9] [女] DSSD513 [ 9] 33.2 156
[G] FPN FRCN [20] 36.2 172
RetinaNet-50-500 视网膜-50-500 32.5 73
RetinaNet-101-500 视网膜-101-500 34.4 90
RetinaNet-101-800 视网膜-101-800 37.8 198
  Not plotted   Extrapolated time
未绘制 外推时间
Figure 2: Speed (ms) versus accuracy (AP) on COCO test-dev. Enabled by the focal loss, our simple one-stage RetinaNet detector outperforms all previous one-stage and two-stage detectors, including the best reported Faster R-CNN [28] system from [20]. We show variants of RetinaNet with ResNet-50-FPN (blue circles) and ResNet-101-FPN (orange diamonds) at five scales (400-800 pixels). Ignoring the low-accuracy regime (AP<<25), RetinaNet forms an upper envelope of all current detectors, and an improved variant (not shown) achieves 40.8 AP. Details are given in §5.
图 2:COCO test-dev 上的速度 (ms) 与精度 (AP) 的关系。在焦点损失的支持下,我们简单的单级 RetinaNet 探测器优于以前的所有单级和两级探测器,包括 [ 20] 中报道的最佳 Faster R-CNN [ 28] 系统。我们展示了 ResNet-50-FPN(蓝色圆圈)和 ResNet-101-FPN(橙色菱形)的 RetinaNet 变体,具有五个比例(400-800 像素)。忽略低精度范围(AP << 25),RetinaNet构成了所有电流检测器的上包络,改进的变体(未显示)达到40.8 AP。

1 Introduction 1引言

Current state-of-the-art object detectors are based on a two-stage, proposal-driven mechanism. As popularized in the R-CNN framework [11], the first stage generates a sparse set of candidate object locations and the second stage classifies each candidate location as one of the foreground classes or as background using a convolutional neural network. Through a sequence of advances [10, 28, 20, 14], this two-stage framework consistently achieves top accuracy on the challenging COCO benchmark [21].
目前最先进的目标检测器基于两阶段的提案驱动机制。正如在R-CNN框架[11]中推广的那样,第一阶段生成一组稀疏的候选对象位置,第二阶段使用卷积神经网络将每个候选位置分类为前景类之一或背景。通过一系列的进步[10,28,20,14],这个两阶段框架在具有挑战性的COCO基准[21]上始终如一地达到最高精度。

Despite the success of two-stage detectors, a natural question to ask is: could a simple one-stage detector achieve similar accuracy? One stage detectors are applied over a regular, dense sampling of object locations, scales, and aspect ratios. Recent work on one-stage detectors, such as YOLO [26, 27] and SSD [22, 9], demonstrates promising results, yielding faster detectors with accuracy within 10-40% relative to state-of-the-art two-stage methods.
尽管两级探测器取得了成功,但一个自然而然的问题是:一个简单的一级探测器能否达到类似的精度?单级检测器应用于对物体位置、比例和纵横比进行常规、密集采样。最近关于单级探测器的工作,如YOLO [ 26, 27] 和 SSD [ 22, 9],显示出有希望的结果,与最先进的两级方法相比,产生了更快的探测器,精度在10-40%之间。

This paper pushes the envelop further: we present a one-stage object detector that, for the first time, matches the state-of-the-art COCO AP of more complex two-stage detectors, such as the Feature Pyramid Network (FPN) [20] or Mask R-CNN [14] variants of Faster R-CNN [28]. To achieve this result, we identify class imbalance during training as the main obstacle impeding one-stage detector from achieving state-of-the-art accuracy and propose a new loss function that eliminates this barrier.
本文进一步推动了包络:我们提出了一种单级目标检测器,该检测器首次与更复杂的两级检测器的最先进的COCO AP相匹配,例如特征金字塔网络(FPN)[20]或Faster R-CNN [28]的掩码R-CNN[14]变体。为了实现这一结果,我们将训练期间的类不平衡确定为阻碍单级检测器实现最先进精度的主要障碍,并提出了一种新的损失函数来消除这一障碍。

Class imbalance is addressed in R-CNN-like detectors by a two-stage cascade and sampling heuristics. The proposal stage (e.g., Selective Search [35], EdgeBoxes [39], DeepMask [24, 25], RPN [28]) rapidly narrows down the number of candidate object locations to a small number (e.g., 1-2k), filtering out most background samples. In the second classification stage, sampling heuristics, such as a fixed foreground-to-background ratio (1:3), or online hard example mining (OHEM) [31], are performed to maintain a manageable balance between foreground and background.
类不平衡在类似 R-CNN 的检测器中通过两级级联和采样启发式来解决。建议阶段(例如,选择性搜索 [ 35]、EdgeBoxes [ 39]、DeepMask [ 24, 25]、RPN [ 28])会迅速将候选对象位置的数量缩小到少量(例如,1-2k),过滤掉大多数背景样本。在第二分类阶段,执行采样启发式方法,例如固定的前景与背景比例(1:3)或在线硬示例挖掘(OHEM)[31],以保持前景和背景之间的可管理平衡。

In contrast, a one-stage detector must process a much larger set of candidate object locations regularly sampled across an image. In practice this often amounts to enumerating similar-to\scriptstyle\sim100k locations that densely cover spatial positions, scales, and aspect ratios. While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classified background examples. This inefficiency is a classic problem in object detection that is typically addressed via techniques such as bootstrapping [33, 29] or hard example mining [37, 8, 31].
相比之下,单级检测器必须处理在图像中定期采样的大量候选对象位置。在实践中,这通常相当于枚举 similar-to\scriptstyle\sim 100k 个位置,这些位置密集地覆盖了空间位置、比例和纵横比。虽然也可以应用类似的抽样启发式方法,但它们效率低下,因为训练过程仍然由易于分类的背景示例主导。这种低效率是目标检测中的一个典型问题,通常通过引导 [ 33, 29] 或硬示例挖掘 [ 37, 8, 31] 等技术来解决。

In this paper, we propose a new loss function that acts as a more effective alternative to previous approaches for dealing with class imbalance. The loss function is a dynamically scaled cross entropy loss, where the scaling factor decays to zero as confidence in the correct class increases, see Figure 1. Intuitively, this scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard examples. Experiments show that our proposed Focal Loss enables us to train a high-accuracy, one-stage detector that significantly outperforms the alternatives of training with the sampling heuristics or hard example mining, the previous state-of-the-art techniques for training one-stage detectors. Finally, we note that the exact form of the focal loss is not crucial, and we show other instantiations can achieve similar results.
在本文中,我们提出了一种新的损失函数,作为处理类别不平衡的先前方法的更有效的替代方案。损失函数是一个动态缩放的交叉熵损失,其中比例因子随着对正确类的置信度增加而衰减为零,参见图 1。直观地说,这个比例因子可以在训练期间自动降低简单示例的贡献,并快速将模型集中在困难的例子上。实验表明,我们提出的焦点损失使我们能够训练一个高精度的单级检测器,该检测器的性能明显优于采样启发式或硬示例挖掘的训练替代方案,这是以前用于训练单级检测器的最先进的技术。最后,我们注意到焦点损失的确切形式并不重要,我们表明其他实例可以达到类似的结果。

To demonstrate the effectiveness of the proposed focal loss, we design a simple one-stage object detector called RetinaNet, named for its dense sampling of object locations in an input image. Its design features an efficient in-network feature pyramid and use of anchor boxes. It draws on a variety of recent ideas from [22, 6, 28, 20]. RetinaNet is efficient and accurate; our best model, based on a ResNet-101-FPN backbone, achieves a COCO test-dev AP of 39.1 while running at 5 fps, surpassing the previously best published single-model results from both one and two-stage detectors, see Figure 2.
为了证明所提出的焦点损失的有效性,我们设计了一个简单的单级物体检测器,称为RetinaNet,因其对输入图像中物体位置的密集采样而得名。其设计具有高效的网络内特征金字塔和锚箱的使用。它借鉴了 [ 22, 6, 28, 20] 中的各种最新想法。RetinaNet高效而准确;我们最好的模型基于 ResNet-101-FPN 主干网,在以 5 fps 的速度运行时实现了 39.1 的 COCO 测试开发 AP,超过了之前发布的单级和两级探测器的最佳单模型结果,见图 2。

2 Related Work 2相关工作

Classic Object Detectors:
经典物体检测器:

The sliding-window paradigm, in which a classifier is applied on a dense image grid, has a long and rich history. One of the earliest successes is the classic work of LeCun et al. who applied convolutional neural networks to handwritten digit recognition [19, 36]. Viola and Jones [37] used boosted object detectors for face detection, leading to widespread adoption of such models. The introduction of HOG [4] and integral channel features [5] gave rise to effective methods for pedestrian detection. DPMs [8] helped extend dense detectors to more general object categories and had top results on PASCAL [7] for many years. While the sliding-window approach was the leading detection paradigm in classic computer vision, with the resurgence of deep learning [18], two-stage detectors, described next, quickly came to dominate object detection.
在密集图像网格上应用分类器的滑动窗口范式具有悠久而丰富的历史。最早的成功之一是LeCun等人的经典工作,他们将卷积神经网络应用于手写数字识别[19,36]。Viola 和 Jones [ 37] 使用增强目标检测器进行人脸检测,导致此类模型的广泛采用。HOG [ 4] 和积分通道特征 [ 5] 的引入催生了行人检测的有效方法。DPM [ 8] 有助于将密集探测器扩展到更通用的物体类别,并在 PASCAL [ 7] 上取得了多年的最佳结果。虽然滑动窗口方法是经典计算机视觉中的主要检测范式,但随着深度学习的复兴[18],下文所述的两级检测器迅速成为目标检测的主导。

Two-stage Detectors: 两级探测器:

The dominant paradigm in modern object detection is based on a two-stage approach. As pioneered in the Selective Search work [35], the first stage generates a sparse set of candidate proposals that should contain all objects while filtering out the majority of negative locations, and the second stage classifies the proposals into foreground classes / background. R-CNN [11] upgraded the second-stage classifier to a convolutional network yielding large gains in accuracy and ushering in the modern era of object detection. R-CNN was improved over the years, both in terms of speed [15, 10] and by using learned object proposals [6, 24, 28]. Region Proposal Networks (RPN) integrated proposal generation with the second-stage classifier into a single convolution network, forming the Faster R-CNN framework [28]. Numerous extensions to this framework have been proposed, e.g. [20, 31, 32, 16, 14].
现代目标检测的主导范式基于两阶段方法。正如选择性搜索工作[35]所开创的,第一阶段生成一组稀疏的候选提案,这些提案应包含所有对象,同时过滤掉大多数负位置,第二阶段将提案分类为前景类/背景类。R-CNN [ 11] 将第二级分类器升级为卷积网络,大大提高了准确性,并开创了现代目标检测时代。多年来,R-CNN在速度[15,10]和使用学习对象建议[6,24,28]方面都得到了改进。区域提案网络(Region Proposal Networks,RPN)将提案生成与第二阶段分类器集成到一个单一的卷积网络中,形成了更快的R-CNN框架[ 28]。已经提出了对这个框架的许多扩展,例如[ 20, 31, 32, 16, 14]。

One-stage Detectors: 单级检测器:

OverFeat [30] was one of the first modern one-stage object detector based on deep networks. More recently SSD [22, 9] and YOLO [26, 27] have renewed interest in one-stage methods. These detectors have been tuned for speed but their accuracy trails that of two-stage methods. SSD has a 10-20% lower AP, while YOLO focuses on an even more extreme speed/accuracy trade-off. See Figure 2. Recent work showed that two-stage detectors can be made fast simply by reducing input image resolution and the number of proposals, but one-stage methods trailed in accuracy even with a larger compute budget [17]. In contrast, the aim of this work is to understand if one-stage detectors can match or surpass the accuracy of two-stage detectors while running at similar or faster speeds.
OverFeat [ 30] 是第一个基于深度网络的现代单级目标探测器之一。最近,SSD [ 22, 9] 和 YOLO [ 26, 27] 重新引起了人们对单阶段方法的兴趣。这些探测器已经针对速度进行了调整,但它们的精度落后于两阶段方法。SSD 的 AP 降低了 10-20%,而 YOLO 则专注于更极端的速度/精度权衡。请参阅图 2。最近的研究表明,只需降低输入图像分辨率和提案数量,就可以快速制造两级探测器,但即使计算预算更大,单级方法的精度也落后[17]。相比之下,这项工作的目的是了解单级探测器在以相似或更快的速度运行时是否可以匹配或超过两级探测器的精度。

The design of our RetinaNet detector shares many similarities with previous dense detectors, in particular the concept of ‘anchors’ introduced by RPN [28] and use of features pyramids as in SSD [22] and FPN [20]. We emphasize that our simple detector achieves top results not based on innovations in network design but due to our novel loss.
我们的 RetinaNet 探测器的设计与以前的密集探测器有许多相似之处,特别是 RPN [ 28] 引入的“锚点”概念以及 SSD [ 22] 和 FPN [ 20] 中特征金字塔的使用。我们强调,我们的简单检测器取得最佳结果不是基于网络设计的创新,而是由于我们的新颖损失。

Class Imbalance: 班级失衡:

Both classic one-stage object detection methods, like boosted detectors [37, 5] and DPMs [8], and more recent methods, like SSD [22], face a large class imbalance during training. These detectors evaluate 104superscript10410^{4}-105superscript10510^{5} candidate locations per image but only a few locations contain objects. This imbalance causes two problems: (1) training is inefficient as most locations are easy negatives that contribute no useful learning signal; (2) en masse, the easy negatives can overwhelm training and lead to degenerate models. A common solution is to perform some form of hard negative mining [33, 37, 8, 31, 22] that samples hard examples during training or more complex sampling/reweighing schemes [2]. In contrast, we show that our proposed focal loss naturally handles the class imbalance faced by a one-stage detector and allows us to efficiently train on all examples without sampling and without easy negatives overwhelming the loss and computed gradients.
无论是经典的单级目标检测方法,如升压检测器 [ 37, 5] 和 DPM [ 8],还是较新的方法,如 SSD [ 22],在训练过程中都面临着很大的类不平衡。这些检测器评估 104superscript10410^{4} 每个图像 105superscript10510^{5} 的候选位置,但只有少数位置包含对象。这种不平衡导致了两个问题:(1)培训效率低下,因为大多数地点都是容易出现负面因素,没有有用的学习信号;(2)总的来说,简单的负数会压倒训练并导致模型退化。一种常见的解决方案是执行某种形式的硬负挖掘 [ 33, 37, 8, 31, 22] ,在训练期间对硬样本进行采样,或者进行更复杂的采样/重新权衡方案 [ 2]。相比之下,我们表明,我们提出的焦点损失自然地处理了单级检测器面临的类不平衡,并允许我们有效地训练所有样本,而无需采样,也没有简单的负数压倒损失和计算梯度。

Robust Estimation: 稳健的估计:

There has been much interest in designing robust loss functions (e.g., Huber loss [13]) that reduce the contribution of outliers by down-weighting the loss of examples with large errors (hard examples). In contrast, rather than addressing outliers, our focal loss is designed to address class imbalance by down-weighting inliers (easy examples) such that their contribution to the total loss is small even if their number is large. In other words, the focal loss performs the opposite role of a robust loss: it focuses training on a sparse set of hard examples.
人们一直对设计鲁棒损失函数(例如Huber损失[13])很感兴趣,该函数通过降低具有大误差的样本(硬示例)的损失的权重来减少异常值的贡献。相比之下,我们的焦点损失不是解决异常值,而是通过降低内值(简单示例)的权重来解决类别不平衡问题,这样即使它们的数量很大,它们对总损失的贡献也很小。换言之,焦点性损失的作用与稳健性损失相反:它将训练重点放在一组稀疏的硬示例上。

3 Focal Loss 3焦点损失

The Focal Loss is designed to address the one-stage object detection scenario in which there is an extreme imbalance between foreground and background classes during training (e.g., 1:1000). We introduce the focal loss starting from the cross entropy (CE) loss for binary classification111Extending the focal loss to the multi-class case is straightforward and works well; for simplicity we focus on the binary loss in this work.
将焦点损失扩展到多类情况很简单,而且效果很好;为简单起见,我们在本工作中重点关注二进制损失。

焦点损失旨在解决单阶段对象检测场景,其中在训练期间前景类和背景类之间存在极度不平衡(例如,1:1000)。我们从二元分类 1 的交叉熵(CE)损失开始引入焦点损失
:

CE(p,y)={log(p)if y=1log(1p)otherwise.CE𝑝𝑦cases𝑝if y=11𝑝otherwise.\textrm{CE}(p,y)=\begin{cases}-\log(p)&\text{if $y=1$}\\ -\log(1-p)&\text{otherwise.}\end{cases} (1)

In the above y{±1}𝑦plus-or-minus1y\in\{\pm 1\} specifies the ground-truth class and p[0,1]𝑝01p\in[0,1] is the model’s estimated probability for the class with label y=1𝑦1y=1. For notational convenience, we define ptsubscript𝑝tp_{\textrm{t}}:
在上面 y{±1}𝑦plus-or-minus1y\in\{\pm 1\} 指定了 ground-truth 类, p[0,1]𝑝01p\in[0,1] 并且是模型对带有标签 y=1𝑦1y=1 的类的估计概率。为了便于表示,我们定义 ptsubscript𝑝tp_{\textrm{t}}

pt={pif y=11potherwise,subscript𝑝tcases𝑝if y=11𝑝otherwise,p_{\textrm{t}}=\begin{cases}p&\text{if $y=1$}\\ 1-p&\text{otherwise,}\end{cases} (2)

and rewrite CE(p,y)=CE(pt)=log(pt)CE𝑝𝑦CEsubscript𝑝tsubscript𝑝t\textrm{CE}(p,y)=\textrm{CE}(p_{\textrm{t}})=-\log(p_{\textrm{t}}). 并重写 CE(p,y)=CE(pt)=log(pt)CE𝑝𝑦CEsubscript𝑝tsubscript𝑝t\textrm{CE}(p,y)=\textrm{CE}(p_{\textrm{t}})=-\log(p_{\textrm{t}}) .

The CE loss can be seen as the blue (top) curve in Figure 1. One notable property of this loss, which can be easily seen in its plot, is that even examples that are easily classified (pt.5much-greater-thansubscript𝑝t.5p_{\textrm{t}}\gg.5) incur a loss with non-trivial magnitude. When summed over a large number of easy examples, these small loss values can overwhelm the rare class.
CE损耗如图1所示的蓝色(顶部)曲线。这种损失的一个值得注意的特性,从它的图中可以很容易地看出,即使是容易分类的例子( pt.5much-greater-thansubscript𝑝t.5p_{\textrm{t}}\gg.5 )也会产生非平凡的损失。当将大量简单的例子相加时,这些小的损失值可能会压倒稀有类。

3.1 Balanced Cross Entropy
3.1平衡交叉熵

A common method for addressing class imbalance is to introduce a weighting factor α[0,1]𝛼01\alpha\in[0,1] for class 111 and 1α1𝛼1-\alpha for class 11-1. In practice α𝛼\alpha may be set by inverse class frequency or treated as a hyperparameter to set by cross validation. For notational convenience, we define αtsubscript𝛼t\alpha_{\textrm{t}} analogously to how we defined ptsubscript𝑝tp_{\textrm{t}}. We write the α𝛼\alpha-balanced CE loss as:
解决类不平衡的常用方法是为类 1111α1𝛼1-\alpha11-1 引入加权因子 α[0,1]𝛼01\alpha\in[0,1] 。在实践中 α𝛼\alpha ,可以通过反类频率进行设置,也可以作为超参数进行交叉验证来设置。为了便于表示,我们的定义 αtsubscript𝛼t\alpha_{\textrm{t}} 类似于我们定义 ptsubscript𝑝tp_{\textrm{t}} 的方式。我们将 α𝛼\alpha 平衡的 CE 损耗写为:

CE(pt)=αtlog(pt).CEsubscript𝑝tsubscript𝛼tsubscript𝑝t\textrm{CE}(p_{\textrm{t}})=-\alpha_{\textrm{t}}\log(p_{\textrm{t}}). (3)

This loss is a simple extension to CE that we consider as an experimental baseline for our proposed focal loss.
这种损失是 CE 的简单扩展,我们认为它是我们提出的焦点损失的实验基线。

3.2 Focal Loss Definition 3.2焦点损失定义

As our experiments will show, the large class imbalance encountered during training of dense detectors overwhelms the cross entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While α𝛼\alpha balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. Instead, we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives.
正如我们的实验所表明的那样,在密集探测器训练过程中遇到的大类不平衡压倒了交叉熵损失。易于分类的负片占损失的大部分,并在梯度中占主导地位。虽然 α𝛼\alpha 平衡了正面/负面例子的重要性,但它并没有区分简单/困难的例子。取而代之的是,我们建议将损失函数重塑为减轻重量的简单示例,从而将训练重点放在硬负数上。

More formally, we propose to add a modulating factor (1pt)γsuperscript1subscript𝑝t𝛾(1-p_{\textrm{t}})^{\gamma} to the cross entropy loss, with tunable focusing parameter γ0𝛾0\gamma\geq 0. We define the focal loss as:
更正式地说,我们建议在交叉熵损失中添加一个调制因子 (1pt)γsuperscript1subscript𝑝t𝛾(1-p_{\textrm{t}})^{\gamma} ,具有可调的聚焦参数 γ0𝛾0\gamma\geq 0 。我们将焦点损失定义为:

FL(pt)=(1pt)γlog(pt).FLsubscript𝑝tsuperscript1subscript𝑝t𝛾subscript𝑝t\textrm{FL}(p_{\textrm{t}})=-(1-p_{\textrm{t}})^{\gamma}\log(p_{\textrm{t}}). (4)

The focal loss is visualized for several values of γ[0,5]𝛾05\gamma\in[0,5] in Figure 1. We note two properties of the focal loss. (1) When an example is misclassified and ptsubscript𝑝tp_{\textrm{t}} is small, the modulating factor is near 111 and the loss is unaffected. As pt1subscript𝑝t1p_{\textrm{t}}\rightarrow 1, the factor goes to 0 and the loss for well-classified examples is down-weighted. (2) The focusing parameter γ𝛾\gamma smoothly adjusts the rate at which easy examples are down-weighted. When γ=0𝛾0\gamma=0, FL is equivalent to CE, and as γ𝛾\gamma is increased the effect of the modulating factor is likewise increased (we found γ=2𝛾2\gamma=2 to work best in our experiments).
图 1 中的几个值 γ[0,5]𝛾05\gamma\in[0,5] 的焦点损失可视化。我们注意到焦点损失的两个特性。(1)当样本分类错误且 ptsubscript𝑝tp_{\textrm{t}} 较小时,调节因子接近 111 ,损耗不受影响。当 pt1subscript𝑝t1p_{\textrm{t}}\rightarrow 1 时,因子变为 0,并且分类良好的示例的损失被降低权重。(2)聚焦参数 γ𝛾\gamma 平滑地调整简单示例的减权速率。当 γ=0𝛾0\gamma=0 时,FL 等同于 CE,并且随着 γ𝛾\gamma 的增加,调节因子的作用也同样增加(我们发现 γ=2𝛾2\gamma=2 在我们的实验中效果最好)。

Intuitively, the modulating factor reduces the loss contribution from easy examples and extends the range in which an example receives low loss. For instance, with γ=2𝛾2\gamma=2, an example classified with pt=0.9subscript𝑝t0.9p_{\textrm{t}}=0.9 would have 100×100\times lower loss compared with CE and with pt0.968subscript𝑝t0.968p_{\textrm{t}}\approx 0.968 it would have 1000×1000\times lower loss. This in turn increases the importance of correcting misclassified examples (whose loss is scaled down by at most 4×4\times for pt.5subscript𝑝t.5p_{\textrm{t}}\leq.5 and γ=2𝛾2\gamma=2).
直观地说,调节因子减少了简单示例的损耗贡献,并扩展了示例接收低损耗的范围。例如,与 相比,分类 γ=2𝛾2\gamma=2 为 的 pt=0.9subscript𝑝t0.9p_{\textrm{t}}=0.9 示例具有 100×100\times 较低的损耗,而与 pt0.968subscript𝑝t0.968p_{\textrm{t}}\approx 0.968 CE 相比,具有 1000×1000\times 更低的损耗。这反过来又增加了纠正错误分类示例的重要性(其损失最多 4×4\times 按比例缩小 for pt.5subscript𝑝t.5p_{\textrm{t}}\leq.5γ=2𝛾2\gamma=2 )。

In practice we use an α𝛼\alpha-balanced variant of the focal loss:
在实践中,我们使用焦点损失的 α𝛼\alpha 平衡变体:

FL(pt)=αt(1pt)γlog(pt).FLsubscript𝑝tsubscript𝛼tsuperscript1subscript𝑝t𝛾subscript𝑝t\textrm{FL}(p_{\textrm{t}})=-\alpha_{\textrm{t}}(1-p_{\textrm{t}})^{\gamma}\log(p_{\textrm{t}}). (5)

We adopt this form in our experiments as it yields slightly improved accuracy over the non-α𝛼\alpha-balanced form. Finally, we note that the implementation of the loss layer combines the sigmoid operation for computing p𝑝p with the loss computation, resulting in greater numerical stability.
我们在实验中采用这种形式,因为它比非 α𝛼\alpha 平衡形式略有提高。最后,我们注意到损失层的实现将用于计算的 sigmoid 操作 p𝑝p 与损失计算相结合,从而获得了更高的数值稳定性。

While in our main experimental results we use the focal loss definition above, its precise form is not crucial. In the appendix we consider other instantiations of the focal loss and demonstrate that these can be equally effective.
虽然在我们的主要实验结果中,我们使用了上面的焦点损失定义,但其精确形式并不重要。在附录中,我们考虑了焦点损失的其他实例,并证明这些实例同样有效。

3.3 Class Imbalance and Model Initialization
3.3类不平衡和模型初始化

Binary classification models are by default initialized to have equal probability of outputting either y=1𝑦1y=-1 or 111. Under such an initialization, in the presence of class imbalance, the loss due to the frequent class can dominate total loss and cause instability in early training. To counter this, we introduce the concept of a ‘prior’ for the value of p𝑝p estimated by the model for the rare class (foreground) at the start of training. We denote the prior by π𝜋\pi and set it so that the model’s estimated p𝑝p for examples of the rare class is low, e.g. 0.010.010.01. We note that this is a change in model initialization (see §4.1) and not of the loss function. We found this to improve training stability for both the cross entropy and focal loss in the case of heavy class imbalance.
默认情况下,二元分类模型初始化为输出 或 y=1𝑦1y=-1 111 的概率相等。在这样的初始化下,在类不平衡的情况下,由于频繁类而导致的损失可以主导总损失,并导致早期训练的不稳定。为了解决这个问题,我们引入了“先验”的概念,用于在训练开始时由模型估计的稀有类(前景)的值 p𝑝p 。我们用 表示 π𝜋\pi 先验并设置它,使模型对稀有类示例的估计 p𝑝p 值较低,例如 0.010.010.01 .我们注意到,这是模型初始化的变化(参见§4.1),而不是损失函数的变化。我们发现,在严重类不平衡的情况下,这可以提高交叉熵和焦点损失的训练稳定性。

3.4 Class Imbalance and Two-stage Detectors
3.4类不平衡和两级探测器

Two-stage detectors are often trained with the cross entropy loss without use of α𝛼\alpha-balancing or our proposed loss. Instead, they address class imbalance through two mechanisms: (1) a two-stage cascade and (2) biased minibatch sampling. The first cascade stage is an object proposal mechanism [35, 24, 28] that reduces the nearly infinite set of possible object locations down to one or two thousand. Importantly, the selected proposals are not random, but are likely to correspond to true object locations, which removes the vast majority of easy negatives. When training the second stage, biased sampling is typically used to construct minibatches that contain, for instance, a 1:3 ratio of positive to negative examples. This ratio is like an implicit α𝛼\alpha-balancing factor that is implemented via sampling. Our proposed focal loss is designed to address these mechanisms in a one-stage detection system directly via the loss function.
两级检测器通常使用交叉熵损失进行训练,而不使用 α𝛼\alpha 平衡或我们提出的损失。相反,他们通过两种机制解决类别不平衡:(1)两级级联和(2)偏置小批量采样。第一个级联阶段是对象提议机制[ 35, 24, 28],它将几乎无限的可能对象位置集减少到一两千个。重要的是,选定的建议不是随机的,而是可能与真实的对象位置相对应,这消除了绝大多数简单的否定因素。在训练第二阶段时,偏差抽样通常用于构建小批量,例如,包含 1:3 的正负样本比例。这个比率就像一个 α𝛼\alpha 隐含的平衡因子,通过抽样来实现。我们提出的焦点损失旨在通过损失函数直接在单级检测系统中解决这些机制。

Refer to caption
Figure 3: The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [20] backbone on top of a feedforward ResNet architecture [16] (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classifying anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network design is intentionally simple, which enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-stage detectors like Faster R-CNN with FPN [20] while running at faster speeds.
图 3:一阶段 RetinaNet 网络架构在前馈 ResNet 架构 [ 16] (a) 之上使用特征金字塔网络 (FPN) [ 20] 骨干来生成丰富的多尺度卷积特征金字塔 (b)。在这个主干上,RetinaNet附加了两个子网,一个用于对锚点盒(c)进行分类,另一个用于从锚点盒回归到真值对象盒(d)。网络设计有意简单,这使得这项工作能够专注于一种新颖的焦点损失函数,该函数消除了我们的一级探测器和最先进的两级探测器之间的精度差距,如具有FPN的Faster R-CNN [ 20],同时以更快的速度运行。

4 RetinaNet Detector 4视网膜检测器

RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network. The first subnet performs convolutional object classification on the backbone’s output; the second subnet performs convolutional bounding box regression. The two subnetworks feature a simple design that we propose specifically for one-stage, dense detection, see Figure 3. While there are many possible choices for the details of these components, most design parameters are not particularly sensitive to exact values as shown in the experiments. We describe each component of RetinaNet next.
RetinaNet 是一个单一的统一网络,由一个骨干网络和两个特定于任务的子网组成。主干负责计算整个输入图像上的卷积特征图,是一个非自卷积网络。第一个子网对主干的输出执行卷积对象分类;第二个子网执行卷积边界框回归。这两个子网采用简单的设计,我们专门针对单阶段密集检测提出,参见图 3。虽然这些组件的细节有许多可能的选择,但大多数设计参数对实验中所示的精确值并不特别敏感。接下来我们将介绍 RetinaNet 的每个组件。

Feature Pyramid Network Backbone:
功能金字塔网络骨干网:

We adopt the Feature Pyramid Network (FPN) from [20] as the backbone network for RetinaNet. In brief, FPN augments a standard convolutional network with a top-down pathway and lateral connections so the network efficiently constructs a rich, multi-scale feature pyramid from a single resolution input image, see Figure 3(a)-(b). Each level of the pyramid can be used for detecting objects at a different scale. FPN improves multi-scale predictions from fully convolutional networks (FCN) [23], as shown by its gains for RPN [28] and DeepMask-style proposals [24], as well at two-stage detectors such as Fast R-CNN [10] or Mask R-CNN [14].
我们采用[ 20] 中的特征金字塔网络 (FPN) 作为 RetinaNet 的骨干网络。简而言之,FPN通过自上而下的路径和横向连接增强了标准卷积网络,因此该网络能够有效地从单个分辨率输入图像构建丰富的多尺度特征金字塔,参见图3(a)-(b)。金字塔的每一层都可用于检测不同比例的物体。FPN改进了全卷积网络(FCN)[23]的多尺度预测,如RPN[28]和DeepMask式建议[24]以及Fast R-CNN [ 10]或Mask R-CNN [ 14]等两级检测器的收益所示。

Following [20], we build FPN on top of the ResNet architecture [16]. We construct a pyramid with levels P3subscript𝑃3P_{3} through P7subscript𝑃7P_{7}, where l𝑙l indicates pyramid level (Plsubscript𝑃𝑙P_{l} has resolution 2lsuperscript2𝑙2^{l} lower than the input). As in [20] all pyramid levels have C=256𝐶256C=256 channels. Details of the pyramid generally follow [20] with a few modest differences.222RetinaNet uses feature pyramid levels P3subscript𝑃3P_{3} to P7subscript𝑃7P_{7}, where P3subscript𝑃3P_{3} to P5subscript𝑃5P_{5} are computed from the output of the corresponding ResNet residual stage (C3subscript𝐶3C_{3} through C5subscript𝐶5C_{5}) using top-down and lateral connections just as in [20], P6subscript𝑃6P_{6} is obtained via a 3×\times3 stride-2 conv on C5subscript𝐶5C_{5}, and P7subscript𝑃7P_{7} is computed by applying ReLU followed by a 3×\times3 stride-2 conv on P6subscript𝑃6P_{6}. This differs slightly from [20]: (1) we don’t use the high-resolution pyramid level P2subscript𝑃2P_{2} for computational reasons, (2) P6subscript𝑃6P_{6} is computed by strided convolution instead of downsampling, and (3) we include P7subscript𝑃7P_{7} to improve large object detection. These minor modifications improve speed while maintaining accuracy.
RetinaNet 使用特征金字塔级别 P3subscript𝑃3P_{3}P7subscript𝑃7P_{7} ,其中 P3subscript𝑃3P_{3} to P5subscript𝑃5P_{5} 是从相应 ResNet 残差阶段 ( C3subscript𝐶3C_{3}C5subscript𝐶5C_{5} ) 的输出计算的,使用自上而下和横向连接,就像 [ 20 一样], P6subscript𝑃6P_{6} 通过 3 ×\times 3 步幅-2 转换获得 C5subscript𝐶5C_{5} ,并通过 P7subscript𝑃7P_{7} 应用 ReLU 后跟 3 ×\times 3 步幅-2 转换来 P6subscript𝑃6P_{6} 计算.这与[ 20]略有不同:(1)出于计算原因,我们不使用高分辨率金字塔水平 P2subscript𝑃2P_{2} ,(2) P6subscript𝑃6P_{6} 通过步幅卷积而不是下采样来计算,以及(3)我们包括 P7subscript𝑃7P_{7} 改进大物体检测。这些微小的修改提高了速度,同时保持了准确性。

在 [ 20] 之后,我们在 ResNet 架构 [ 16] 之上构建 FPN。我们构造一个金字塔,其中 P3subscript𝑃3P_{3} P7subscript𝑃7P_{7} l𝑙l 表示金字塔水平( Plsubscript𝑃𝑙P_{l} 分辨率 2lsuperscript2𝑙2^{l} 低于输入)。如[20],所有金字塔层级都有 C=256𝐶256C=256 通道。金字塔的细节通常遵循[20],但有一些适度的差异。 2
While many design choices are not crucial, we emphasize the use of the FPN backbone is; preliminary experiments using features from only the final ResNet layer yielded low AP.
虽然许多设计选择并不重要,但我们强调使用 FPN 主干是;仅使用最终 ResNet 层的特征进行初步实验,结果显示 AP 较低。

Anchors: 锚:

We use translation-invariant anchor boxes similar to those in the RPN variant in [20]. The anchors have areas of 322superscript32232^{2} to 5122superscript5122512^{2} on pyramid levels P3subscript𝑃3P_{3} to P7subscript𝑃7P_{7}, respectively. As in [20], at each pyramid level we use anchors at three aspect ratios {1\{1:2,22, 111:111, 222:1}1\}. For denser scale coverage than in [20], at each level we add anchors of sizes {20superscript202^{0}, 21/3superscript2132^{1/3}, 22/3superscript2232^{2/3}} of the original set of 3 aspect ratio anchors. This improve AP in our setting. In total there are A=9𝐴9A=9 anchors per level and across levels they cover the scale range 32 - 813 pixels with respect to the network’s input image.
我们使用类似于 [ 20] 中 RPN 变体中的平移不变锚框。锚点在金字塔水平上的区域分别为 322superscript32232^{2}5122superscript5122512^{2}P7subscript𝑃7P_{7}P3subscript𝑃3P_{3} 如[ 20],在每个金字塔级别,我们使用三种纵横比的锚点 {1\{12,22, 111 :, 111 2221}1\} 。对于比 [ 20 更密集的比例覆盖范围,在每个级别上,我们添加大小为 { 20superscript202^{0}21/3superscript2132^{1/3}22/3superscript2232^{2/3} } 的原始 3 个纵横比锚点集的锚点。这改善了我们设置中的 AP。总的来说,每个级别都有 A=9𝐴9A=9 锚点,并且在各个级别中,它们覆盖了相对于网络输入图像的 32 - 813 像素的比例范围。

Each anchor is assigned a length K𝐾K one-hot vector of classification targets, where K𝐾K is the number of object classes, and a 4-vector of box regression targets. We use the assignment rule from RPN [28] but modified for multi-class detection and with adjusted thresholds. Specifically, anchors are assigned to ground-truth object boxes using an intersection-over-union (IoU) threshold of 0.5; and to background if their IoU is in [0, 0.4). As each anchor is assigned to at most one object box, we set the corresponding entry in its length K𝐾K label vector to 111 and all other entries to 00. If an anchor is unassigned, which may happen with overlap in [0.4, 0.5), it is ignored during training. Box regression targets are computed as the offset between each anchor and its assigned object box, or omitted if there is no assignment.
每个锚点都被分配了一个长度 K𝐾K 为 1 热的分类目标向量,其中 K𝐾K 1 个是对象类的数量,以及一个 4 个箱回归目标向量。我们使用 RPN [ 28] 中的分配规则,但针对多类检测进行了修改,并调整了阈值。具体而言,锚点使用 0.5 的交集并集 (IoU) 阈值分配给地面实况对象框;如果他们的 IoU 在 [0, 0.4] 中,则为背景。由于每个锚点最多分配给一个对象框,因此我们将其长度 K𝐾K 标签向量中的相应条目设置为 , 111 将所有其他条目设置为 00 。如果锚点未赋值(在 [0.4, 0.5] 中重叠时可能会发生),则在训练期间将忽略该锚点。框回归目标计算为每个锚点与其分配的对象框之间的偏移量,如果没有赋值,则省略。

Classification Subnet: 分类子网:

The classification subnet predicts the probability of object presence at each spatial position for each of the A𝐴A anchors and K𝐾K object classes. This subnet is a small FCN attached to each FPN level; parameters of this subnet are shared across all pyramid levels. Its design is simple. Taking an input feature map with C𝐶C channels from a given pyramid level, the subnet applies four 3×\times3 conv layers, each with C𝐶C filters and each followed by ReLU activations, followed by a 3×\times3 conv layer with KA𝐾𝐴KA filters. Finally sigmoid activations are attached to output the KA𝐾𝐴KA binary predictions per spatial location, see Figure 3 (c). We use C=256𝐶256C=256 and A=9𝐴9A=9 in most experiments.
分类子网可预测每个 A𝐴A 锚点和 K𝐾K 对象类在每个空间位置出现对象的概率。此子网是附加到每个 FPN 级别的小型 FCN;此子网的参数在所有金字塔级别之间共享。它的设计很简单。子网采用具有给定金字塔级别的通道的 C𝐶C 输入特征图,应用四个 3 ×\times 3 转换层,每个转换层都有 C𝐶C 过滤器,每个转换层都经过 ReLU 激活,然后是带有 KA𝐾𝐴KA 过滤器的 3 ×\times 3 转换层。最后,附加 sigmoid 激活以输出每个空间位置的 KA𝐾𝐴KA 二进制预测,参见图 3 (c)。我们在大多数实验中使用 C=256𝐶256C=256A=9𝐴9A=9

In contrast to RPN [28], our object classification subnet is deeper, uses only 3×\times3 convs, and does not share parameters with the box regression subnet (described next). We found these higher-level design decisions to be more important than specific values of hyperparameters.
与 RPN [ 28] 相比,我们的对象分类子网更深,仅使用 3 ×\times 3 个 convs,并且不与盒回归子网共享参数(如下所述)。我们发现这些更高级别的设计决策比超参数的特定值更重要。

Box Regression Subnet: Box 回归子网:

In parallel with the object classification subnet, we attach another small FCN to each pyramid level for the purpose of regressing the offset from each anchor box to a nearby ground-truth object, if one exists. The design of the box regression subnet is identical to the classification subnet except that it terminates in 4A4𝐴4A linear outputs per spatial location, see Figure 3 (d). For each of the A𝐴A anchors per spatial location, these 444 outputs predict the relative offset between the anchor and the ground-truth box (we use the standard box parameterization from R-CNN [11]). We note that unlike most recent work, we use a class-agnostic bounding box regressor which uses fewer parameters and we found to be equally effective. The object classification subnet and the box regression subnet, though sharing a common structure, use separate parameters.
与对象分类子网并行,我们将另一个小 FCN 附加到每个金字塔级别,以便将每个锚点框的偏移量回归到附近的真值对象(如果存在)。箱形回归子网的设计与分类子网相同,只是它以 4A4𝐴4A 每个空间位置的线性输出终止,参见图 3 (d)。对于每个空间位置的每个 A𝐴A 锚点,这些 444 输出预测锚点和地面实况框之间的相对偏移量(我们使用 R-CNN [ 11] 中的标准框参数化)。我们注意到,与最近的工作不同,我们使用与类无关的边界框回归器,它使用较少的参数,我们发现它同样有效。对象分类子网和箱回归子网虽然共享一个通用结构,但使用单独的参数。

4.1 Inference and Training
4.1推理和训练

Inference: 推理:

RetinaNet forms a single FCN comprised of a ResNet-FPN backbone, a classification subnet, and a box regression subnet, see Figure 3. As such, inference involves simply forwarding an image through the network. To improve speed, we only decode box predictions from at most 1k top-scoring predictions per FPN level, after thresholding detector confidence at 0.05. The top predictions from all levels are merged and non-maximum suppression with a threshold of 0.5 is applied to yield the final detections.
RetinaNet 形成一个单一的 FCN,由 ResNet-FPN 主干网、分类子网和盒回归子网组成,参见图 3。因此,推理涉及简单地通过网络转发图像。为了提高速度,在将检测器置信度阈值为 0.05 后,我们仅从每个 FPN 级别最多 1k 个得分最高的预测中解码框预测。合并所有级别的最高预测,并应用阈值为 0.5 的非最大抑制以生成最终检测结果。

Focal Loss: 焦点损失:

We use the focal loss introduced in this work as the loss on the output of the classification subnet. As we will show in §5, we find that γ=2𝛾2\gamma=2 works well in practice and the RetinaNet is relatively robust to γ[0.5,5]𝛾0.55\gamma\in[0.5,5]. We emphasize that when training RetinaNet, the focal loss is applied to all similar-to\scriptstyle\sim100k anchors in each sampled image. This stands in contrast to common practice of using heuristic sampling (RPN) or hard example mining (OHEM, SSD) to select a small set of anchors (e.g., 256) for each minibatch. The total focal loss of an image is computed as the sum of the focal loss over all similar-to\scriptstyle\sim100k anchors, normalized by the number of anchors assigned to a ground-truth box. We perform the normalization by the number of assigned anchors, not total anchors, since the vast majority of anchors are easy negatives and receive negligible loss values under the focal loss. Finally we note that α𝛼\alpha, the weight assigned to the rare class, also has a stable range, but it interacts with γ𝛾\gamma making it necessary to select the two together (see Tables 1a and 1b). In general α𝛼\alpha should be decreased slightly as γ𝛾\gamma is increased (for γ=2𝛾2\gamma=2, α=0.25𝛼0.25\alpha=0.25 works best).
我们使用本文中介绍的焦点损失作为分类子网输出上的损失。正如我们将在 §5 中展示的那样,我们发现它 γ=2𝛾2\gamma=2 在实践中运行良好,并且 RetinaNet 对 γ[0.5,5]𝛾0.55\gamma\in[0.5,5] .我们强调,在训练 RetinaNet 时,焦点损失应用于每个采样图像中的所有 similar-to\scriptstyle\sim 100k 锚点。这与使用启发式采样 (RPN) 或硬示例挖掘(OHEM、SSD)为每个小批量选择一小组锚点(例如 256)的常见做法形成鲜明对比。图像的总焦距损失计算为所有 similar-to\scriptstyle\sim 100k 个锚点的焦距损失之和,由分配给地面实况框的锚点数归一化。我们通过分配的锚点数量而不是总锚点来执行归一化,因为绝大多数锚点都是简单的负数,并且在焦点损失下获得的损失值可以忽略不计。最后, α𝛼\alpha 我们注意到,分配给稀有类的权重也具有稳定的范围,但它与 γ𝛾\gamma 需要同时选择两者相互作用(见表 1a 和 1b)。一般来说 α𝛼\alpha ,随着增加,应略有 γ𝛾\gamma 减少(对于 γ=2𝛾2\gamma=2α=0.25𝛼0.25\alpha=0.25 效果最好)。

α𝛼\alpha AP AP50 AP75
.10 0.0 0.0 0.0
.25 10.8 16.0 11.7
.50 30.2 46.7 32.8
.75 31.1 49.4 33.0
.90 30.8 49.7 32.3
.99 28.7 47.4 29.9
.999 25.1 41.7 26.1
(a) Varying α𝛼\alpha for CE loss (γ=0𝛾0\gamma=0)
(一)因 CE 损耗而变化 α𝛼\alphaγ=0𝛾0\gamma=0
γ𝛾\gamma α𝛼\alpha AP AP50 AP75
0 .75 31.1 49.4 33.0
0.1 .75 31.4 49.9 33.1
0.2 .75 31.9 50.7 33.4
0.5 .50 32.9 51.7 35.2
1.0 .25 33.7 52.0 36.2
2.0 .25 34.0 52.5 36.5
5.0 .25 32.2 49.6 34.8
(b) Varying γ𝛾\gamma for FL (w. optimal α𝛼\alpha)
(二) γ𝛾\gamma 因 FL 而异 (w. 最佳 α𝛼\alpha
#sc #ar AP AP50 AP75
1 1 30.3 49.0 31.8
2 1 31.9 50.0 34.0
3 1 31.8 49.4 33.7
1 3 32.4 52.3 33.9
2 3 34.2 53.1 36.5
3 3 34.0 52.5 36.5
4 3 33.8 52.1 36.2
(c) Varying anchor scales and aspects
(c) 不同的锚尺和方面
method batch nms AP AP50 AP75
size thr
OHEM 128 .7 31.1 47.2 33.2
OHEM 256 .7 31.8 48.8 33.9
OHEM 512 .7 30.6 47.0 32.6
OHEM 128 .5 32.8 50.3 35.1
OHEM 256 .5 31.0 47.4 33.0
OHEM 512 .5 27.6 42.0 29.2
OHEM 1:3 128 .5 31.1 47.2 33.2
OHEM 1:3 256 .5 28.3 42.4 30.3
OHEM 1:3 512 .5 24.0 35.5 25.8
FL n/a n/a 36.0 54.9 38.7
(d) FL vs. OHEM baselines (with ResNet-101-FPN)
(四)FL 与 OHEM 基线(使用 ResNet-101-FPN)
depth scale AP AP50 AP75 APS APM APL time
50 400 30.5 47.8 32.7 11.2 33.8 46.1 64
50 500 32.5 50.9 34.8 13.9 35.8 46.7 72
50 600 34.3 53.2 36.9 16.2 37.4 47.4 98
50 700 35.1 54.2 37.7 18.0 39.3 46.4 121
50 800 35.7 55.0 38.5 18.9 38.9 46.3 153
101 400 31.9 49.5 34.1 11.6 35.8 48.5 81
101 500 34.4 53.1 36.8 14.7 38.5 49.1 90
101 600 36.0 55.2 38.7 17.4 39.6 49.7 122
101 700 37.1 56.6 39.8 19.1 40.6 49.4 154
101 800 37.8 57.5 40.8 20.2 41.1 49.2 198
(e) Accuracy/speed trade-off RetinaNet (on test-dev)
(e) 精度/速度权衡 RetinaNet(在测试开发中)
Table 1: Ablation experiments for RetinaNet and Focal Loss (FL). All models are trained on trainval35k and tested on minival unless noted. If not specified, default values are: γ=2𝛾2\gamma=2; anchors for 3 scales and 3 aspect ratios; ResNet-50-FPN backbone; and a 600 pixel train and test image scale. (a) RetinaNet with α𝛼\alpha-balanced CE achieves at most 31.1 AP. (b) In contrast, using FL with the same exact network gives a 2.9 AP gain and is fairly robust to exact γ𝛾\gamma/α𝛼\alpha settings. (c) Using 2-3 scale and 3 aspect ratio anchors yields good results after which point performance saturates. (d) FL outperforms the best variants of online hard example mining (OHEM) [31, 22] by over 3 points AP. (e) Accuracy/Speed trade-off of RetinaNet on test-dev for various network depths and image scales (see also Figure 2).
表 1:RetinaNet 和焦点损失 (FL) 的消融实验。除非另有说明,否则所有模型均在 trainval35k 上训练并在 minival 上进行测试。如果未指定,则默认值为: γ=2𝛾2\gamma=2 ;用于 3 个比例和 3 个纵横比的锚点;ResNet-50-FPN主干网;以及 600 像素的训练和测试图像比例。(a) 具有平衡 CE 的 α𝛼\alpha RetinaNet 最多可实现 31.1 AP。 (b) 相比之下,使用具有相同精确网络的 FL 可获得 2.9 AP 增益,并且对精确 γ𝛾\gamma / α𝛼\alpha 设置相当鲁棒。(c) 使用 2-3 个比例和 3 个纵横比锚点会产生良好的结果,之后点性能饱和。(d) FL 比在线硬示例挖掘 (OHEM) [ 31, 22] 的最佳变体高出 3 分以上。 (e) RetinaNet 在测试开发中针对各种网络深度和图像尺度的准确性/速度权衡(另见图 2)。

Initialization: 初始化:

We experiment with ResNet-50-FPN and ResNet-101-FPN backbones [20]. The base ResNet-50 and ResNet-101 models are pre-trained on ImageNet1k; we use the models released by [16]. New layers added for FPN are initialized as in [20]. All new conv layers except the final one in the RetinaNet subnets are initialized with bias b=0𝑏0b=0 and a Gaussian weight fill with σ=0.01𝜎0.01\sigma=0.01. For the final conv layer of the classification subnet, we set the bias initialization to b=log((1π)/π)𝑏1𝜋𝜋b=-\log((1-\pi)/\pi), where π𝜋\pi specifies that at the start of training every anchor should be labeled as foreground with confidence of similar-to\scriptstyle\simπ𝜋\pi. We use π=.01𝜋.01\pi=.01 in all experiments, although results are robust to the exact value. As explained in §3.3, this initialization prevents the large number of background anchors from generating a large, destabilizing loss value in the first iteration of training.
我们试验了 ResNet-50-FPN 和 ResNet-101-FPN 骨干网 [ 20]。基本 ResNet-50 和 ResNet-101 模型在 ImageNet1k 上进行了预训练;我们使用 [ 16] 发布的模型。为 FPN 添加的新层将初始化,如 [ 20 所示。除 RetinaNet 子网中的最后一个层外,所有新的卷积层都使用偏置 b=0𝑏0b=0 和高斯权重填充进行 σ=0.01𝜎0.01\sigma=0.01 初始化。对于分类子网的最终卷积层,我们将偏置初始化设置为 b=log((1π)/π)𝑏1𝜋𝜋b=-\log((1-\pi)/\pi) ,其中 π𝜋\pi 指定在训练开始时,每个锚点都应标记为前景,置信度为 similar-to\scriptstyle\sim π𝜋\pi 。我们在所有实验中都使用 π=.01𝜋.01\pi=.01 ,尽管结果对确切值是稳健的。如 §3.3 中所述,此初始化可防止大量后台锚点在训练的第一次迭代中产生较大的、不稳定的损失值。

Optimization: 优化:

RetinaNet is trained with stochastic gradient descent (SGD). We use synchronized SGD over 8 GPUs with a total of 16 images per minibatch (2 images per GPU). Unless otherwise specified, all models are trained for 90k iterations with an initial learning rate of 0.01, which is then divided by 10 at 60k and again at 80k iterations. We use horizontal image flipping as the only form of data augmentation unless otherwise noted. Weight decay of 0.0001 and momentum of 0.9 are used. The training loss is the sum the focal loss and the standard smooth L1subscript𝐿1L_{1} loss used for box regression [10]. Training time ranges between 10 and 35 hours for the models in Table 1e.
RetinaNet 使用随机梯度下降 (SGD) 进行训练。我们在 8 个 GPU 上使用同步的 SGD,每个小批量总共 16 个图像(每个 GPU 2 个图像)。除非另有说明,否则所有模型都针对 90k 迭代进行训练,初始学习率为 0.01,然后在 60k 时除以 10,在 80k 迭代时再次除以。除非另有说明,否则我们使用水平图像翻转作为数据增强的唯一形式。使用0.0001的重量衰减和0.9的动量。训练损失是用于箱式回归的焦点损失和标准平滑 L1subscript𝐿1L_{1} 损失的总和 [ 10]。表1e中模型的训练时间范围为10至35小时。

5 Experiments 5实验

We present experimental results on the bounding box detection track of the challenging COCO benchmark [21]. For training, we follow common practice [1, 20] and use the COCO trainval35k split (union of 80k images from train and a random 35k subset of images from the 40k image val split). We report lesion and sensitivity studies by evaluating on the minival split (the remaining 5k images from val). For our main results, we report COCO AP on the test-dev split, which has no public labels and requires use of the evaluation server.
本文给出了具有挑战性的COCO基准[21]的边界框检测轨迹的实验结果。对于训练,我们遵循常见做法 [ 1, 20] 并使用 COCO trainval35k 拆分(来自火车的 80k 图像和来自 40k 图像 val 拆分的随机 35k 图像子集的联合)。我们通过评估小瓦尔分裂(来自 val 的剩余 5k 图像)来报告病变和敏感性研究。对于我们的主要结果,我们在测试-开发拆分上报告了 COCO AP,它没有公共标签,需要使用评估服务器。

Refer to caption
Refer to caption
Figure 4: Cumulative distribution functions of the normalized loss for positive and negative samples for different values of γ𝛾\gamma for a converged model. The effect of changing γ𝛾\gamma on the distribution of the loss for positive examples is minor. For negatives, however, increasing γ𝛾\gamma heavily concentrates the loss on hard examples, focusing nearly all attention away from easy negatives.
图 4:收敛模型不同值 γ𝛾\gamma 的正样本和负样本归一化损失的累积分布函数。对于正面例子,变化 γ𝛾\gamma 对损失分布的影响很小。然而,对于负面因素来说,增加 γ𝛾\gamma 会严重将损失集中在困难的例子上,将几乎所有的注意力都集中在简单的负面因素上。

5.1 Training Dense Detection
5.1训练密集检测

We run numerous experiments to analyze the behavior of the loss function for dense detection along with various optimization strategies. For all experiments we use depth 50 or 101 ResNets [16] with a Feature Pyramid Network (FPN) [20] constructed on top. For all ablation studies we use an image scale of 600 pixels for training and testing.
我们进行了大量的实验来分析密集检测的损失函数的行为以及各种优化策略。对于所有实验,我们使用深度为 50 或 101 的 ResNets [ 16],并在顶部构建特征金字塔网络 (FPN) [ 20]。对于所有消融研究,我们使用 600 像素的图像刻度进行训练和测试。

Network Initialization: 网络初始化:

Our first attempt to train RetinaNet uses standard cross entropy (CE) loss without any modifications to the initialization or learning strategy. This fails quickly, with the network diverging during training. However, simply initializing the last layer of our model such that the prior probability of detecting an object is π=.01𝜋.01\pi=.01 (see §4.1) enables effective learning. Training RetinaNet with ResNet-50 and this initialization already yields a respectable AP of 30.2 on COCO. Results are insensitive to the exact value of π𝜋\pi so we use π=.01𝜋.01\pi=.01 for all experiments.
我们第一次尝试使用标准的交叉熵(CE)损失来训练RetinaNet,而没有对初始化或学习策略进行任何修改。这很快就会失败,因为网络在训练期间会发散。然而,简单地初始化我们模型的最后一层,使得检测对象的先验概率为 π=.01𝜋.01\pi=.01 (参见§4.1),就可以实现有效的学习。使用 ResNet-50 训练 RetinaNet,并且此初始化已经在 COCO 上产生了 30.2 的可观 AP。结果对 的 π𝜋\pi 确切值不敏感,因此我们 π=.01𝜋.01\pi=.01 用于所有实验。

Balanced Cross Entropy: 平衡交叉熵:

Our next attempt to improve learning involved using the α𝛼\alpha-balanced CE loss described in §3.1. Results for various α𝛼\alpha are shown in Table 1a. Setting α=.75𝛼.75\alpha=.75 gives a gain of 0.9 points AP.
我们改进学习的下一个尝试涉及使用§3.1中描述的 α𝛼\alpha 平衡CE损失。各种 α𝛼\alpha 结果如表1a所示。设置 α=.75𝛼.75\alpha=.75 时,AP 增益为 0.9 点。

Focal Loss: 焦点损失:

Results using our proposed focal loss are shown in Table 1b. The focal loss introduces one new hyperparameter, the focusing parameter γ𝛾\gamma, that controls the strength of the modulating term. When γ=0𝛾0\gamma=0, our loss is equivalent to the CE loss. As γ𝛾\gamma increases, the shape of the loss changes so that “easy” examples with low loss get further discounted, see Figure 1. FL shows large gains over CE as γ𝛾\gamma is increased. With γ=2𝛾2\gamma=2, FL yields a 2.9 AP improvement over the α𝛼\alpha-balanced CE loss.
使用我们建议的焦点损失的结果如表1b所示。焦点损失引入了一个新的超参数,即聚焦参数 γ𝛾\gamma ,用于控制调制项的强度。当 γ=0𝛾0\gamma=0 时,我们的损失相当于 CE 损失。随着 γ𝛾\gamma 损失的增加,损失的形状会发生变化,因此具有低损失的“简单”示例会进一步打折,参见图 1。FL 显示出比 CE 更大的收益,因为 γ𝛾\gamma 增加了。使用 γ=2𝛾2\gamma=2 ,FL 比 α𝛼\alpha 平衡 CE 损耗提高了 2.9 AP。

For the experiments in Table 1b, for a fair comparison we find the best α𝛼\alpha for each γ𝛾\gamma. We observe that lower α𝛼\alpha’s are selected for higher γ𝛾\gamma’s (as easy negatives are down-weighted, less emphasis needs to be placed on the positives). Overall, however, the benefit of changing γ𝛾\gamma is much larger, and indeed the best α𝛼\alpha’s ranged in just [.25,.75] (we tested α[.01,.999]𝛼.01.999\alpha\in[.01,.999]). We use γ=2.0𝛾2.0\gamma=2.0 with α=.25𝛼.25\alpha=.25 for all experiments but α=.5𝛼.5\alpha=.5 works nearly as well (.4 AP lower).
对于表1b中的实验,为了进行公平的比较,我们发现每个 γ𝛾\gamma 实验都是最好的 α𝛼\alpha 。我们观察到,较低的 α𝛼\alpha 's 被选为较高 γ𝛾\gamma 的 's(由于简单的否定因素被降低了权重,因此需要较少强调积极因素)。然而,总的来说,改变 γ𝛾\gamma 的好处要大得多,事实上,最好的 α𝛼\alpha 范围只有 [.25,.75](我们测试 α[.01,.999]𝛼.01.999\alpha\in[.01,.999] 了)。 γ=2.0𝛾2.0\gamma=2.0 我们 α=.25𝛼.25\alpha=.25 用于所有实验,但 α=.5𝛼.5\alpha=.5 效果几乎一样好(AP 低 0.4)。

Analysis of the Focal Loss:
焦点损失分析:

To understand the focal loss better, we analyze the empirical distribution of the loss of a converged model. For this, we take take our default ResNet-101 600-pixel model trained with γ=2𝛾2\gamma=2 (which has 36.0 AP). We apply this model to a large number of random images and sample the predicted probability for similar-to\scriptstyle\sim107superscript10710^{7} negative windows and similar-to\scriptstyle\sim105superscript10510^{5} positive windows. Next, separately for positives and negatives, we compute FL for these samples, and normalize the loss such that it sums to one. Given the normalized loss, we can sort the loss from lowest to highest and plot its cumulative distribution function (CDF) for both positive and negative samples and for different settings for γ𝛾\gamma (even though model was trained with γ=2𝛾2\gamma=2).
为了更好地理解焦点损失,我们分析了收敛模型损失的经验分布。为此,我们采用默认的 ResNet-101 600 像素模型 γ=2𝛾2\gamma=2 (具有 36.0 AP)进行训练。我们将该模型应用于大量随机图像,并对 107superscript10710^{7} 负窗口和 similar-to\scriptstyle\sim 105superscript10510^{5} 正窗口的预测概率 similar-to\scriptstyle\sim 进行采样。接下来,分别针对正值和负值,我们计算这些样本的 FL,并将损失归一化,使其总和为 1。给定归一化损失,我们可以将损失从低到高排序,并绘制其正负样本和不同设置的累积分布函数 (CDF γ𝛾\gamma )(即使模型是用 γ=2𝛾2\gamma=2 训练的)。

Cumulative distribution functions for positive and negative samples are shown in Figure 4. If we observe the positive samples, we see that the CDF looks fairly similar for different values of γ𝛾\gamma. For example, approximately 20% of the hardest positive samples account for roughly half of the positive loss, as γ𝛾\gamma increases more of the loss gets concentrated in the top 20% of examples, but the effect is minor.
正性样品和负性样品的累积分布函数如图4所示。如果我们观察正样本,我们会发现 CDF 对于不同的 值 γ𝛾\gamma 非常相似。例如,大约 20% 的最难阳性样本约占正损失的一半,因为 γ𝛾\gamma 增加的损失集中在前 20% 的样本中,但影响很小。

The effect of γ𝛾\gamma on negative samples is dramatically different. For γ=0𝛾0\gamma=0, the positive and negative CDFs are quite similar. However, as γ𝛾\gamma increases, substantially more weight becomes concentrated on the hard negative examples. In fact, with γ=2𝛾2\gamma=2 (our default setting), the vast majority of the loss comes from a small fraction of samples. As can be seen, FL can effectively discount the effect of easy negatives, focusing all attention on the hard negative examples.
γ𝛾\gamma 对阴性样品的影响截然不同。对于 γ=0𝛾0\gamma=0 ,正 CDF 和负 CDF 非常相似。然而,随着 γ𝛾\gamma 增加,更多的权重集中在硬性反面例子上。事实上,在(我们的默认设置)下 γ=2𝛾2\gamma=2 ,绝大多数损失来自一小部分样本。可以看出,FL可以有效地打折简单否定的影响,将所有注意力集中在硬否定的例子上。

backbone AP AP50 AP75 APS APM APL
Two-stage methods 两阶段法
 Faster R-CNN+++ [16]
更快的 R-CNN+++ [ 16]
ResNet-101-C4 ResNet-101-C4型 34.9 55.7 37.4 15.6 38.7 50.9
 Faster R-CNN w FPN [20]
更快的 R-CNN 带 FPN [ 20]
ResNet-101-FPN ResNet-101-FPN系列 36.2 59.1 39.0 18.2 39.0 48.2
 Faster R-CNN by G-RMI [17]
更快的 R-CNN 由 G-RMI [ 17]
Inception-ResNet-v2 [34]
盗梦空间-ResNet-v2 [ 34]
34.7 55.5 36.7 13.5 38.1 52.0
 Faster R-CNN w TDM [32]
更快的 R-CNN 和 TDM [ 32]
Inception-ResNet-v2-TDM 盗梦空间-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1
One-stage methods 一阶段法
 YOLOv2 [27] DarkNet-19 [27] 暗网-19 [ 27] 21.6 44.0 19.2 5.0 22.4 35.5
 SSD513 [22, 9]
固态硬盘513 [ 22, 9]
ResNet-101-SSD ResNet-101-固态硬盘 31.2 50.4 33.3 10.2 34.5 49.8
 DSSD513 [9] ResNet-101-DSSD ResNet-101-DSSD系列 33.2 53.3 35.2 13.0 35.4 51.1
 RetinaNet (ours) RetinaNet(我们的) ResNet-101-FPN ResNet-101-FPN系列 39.1 59.1 42.3 21.8 42.7 50.2
 RetinaNet (ours) RetinaNet(我们的) ResNeXt-101-FPN ResNeXt-101-FPN系列 40.8 61.1 44.1 24.1 44.2 51.2
Table 2: Object detection single-model results (bounding box AP), vs. state-of-the-art on COCO test-dev. We show results for our RetinaNet-101-800 model, trained with scale jitter and for 1.5×\times longer than the same model from Table 1e. Our model achieves top results, outperforming both one-stage and two-stage models. For a detailed breakdown of speed versus accuracy see Table 1e and Figure 2.
表 2:目标检测单模型结果(边界框 AP)与 COCO test-dev 上的最新结果。我们显示了 RetinaNet-101-800 模型的结果,该模型使用刻度抖动进行训练,比表 1e 中的相同模型 ×\times 长 1.5 年。我们的模型取得了最佳效果,优于一阶段和两阶段模型。有关速度与精度的详细细分,请参见表 1e 和图 2。

Online Hard Example Mining (OHEM):
在线硬实例挖掘 (OHEM):

[31] proposed to improve training of two-stage detectors by constructing minibatches using high-loss examples. Specifically, in OHEM each example is scored by its loss, non-maximum suppression (nms) is then applied, and a minibatch is constructed with the highest-loss examples. The nms threshold and batch size are tunable parameters. Like the focal loss, OHEM puts more emphasis on misclassified examples, but unlike FL, OHEM completely discards easy examples. We also implement a variant of OHEM used in SSD [22]: after applying nms to all examples, the minibatch is constructed to enforce a 1:3 ratio between positives and negatives to help ensure each minibatch has enough positives.
[ 31] 提出通过使用高损耗示例构建小批量来改进两级探测器的训练。具体来说,在OHEM中,每个示例都根据其损耗进行评分,然后应用非最大抑制(nms),并使用最高损耗示例构建小批量。nms 阈值和批量大小是可调参数。与焦点损失一样,OHEM 更强调错误分类的例子,但与 FL 不同的是,OHEM 完全放弃了简单的例子。我们还实现了 SSD 中使用的 OHEM 变体 [ 22]:将 nms 应用于所有示例后,小批量被构建为强制执行正负之间的 1:3 比例,以帮助确保每个小批量都有足够的正极。

We test both OHEM variants in our setting of one-stage detection which has large class imbalance. Results for the original OHEM strategy and the ‘OHEM 1:3’ strategy for selected batch sizes and nms thresholds are shown in Table 1d. These results use ResNet-101, our baseline trained with FL achieves 36.0 AP for this setting. In contrast, the best setting for OHEM (no 1:3 ratio, batch size 128, nms of .5) achieves 32.8 AP. This is a gap of 3.2 AP, showing FL is more effective than OHEM for training dense detectors. We note that we tried other parameter setting and variants for OHEM but did not achieve better results.
我们在具有较大类别不平衡的单阶段检测设置中测试了两种 OHEM 变体。表 1d 显示了原始 OHEM 策略和选定批量大小和 nms 阈值的“OHEM 1:3”策略的结果。这些结果使用 ResNet-101,我们使用 FL 训练的基线在此设置下实现了 36.0 AP。相比之下,OHEM 的最佳设置(无 1:3 比率,批次大小 128,nms 为 .5)达到 32.8 AP。这是 3.2 AP 的差距,表明 FL 在训练密集探测器方面比 OHEM 更有效。我们注意到,我们尝试了OHEM的其他参数设置和变体,但没有获得更好的结果。

Hinge Loss: 铰链损失:

Finally, in early experiments, we attempted to train with the hinge loss [13] on ptsubscript𝑝tp_{\textrm{t}}, which sets loss to 0 above a certain value of ptsubscript𝑝tp_{\textrm{t}}. However, this was unstable and we did not manage to obtain meaningful results. Results exploring alternate loss functions are in the appendix.
最后,在早期的实验中,我们尝试用铰链损耗 [ 13] 进行 ptsubscript𝑝tp_{\textrm{t}} 训练,将 的损耗设置为 0 高于某个值 ptsubscript𝑝tp_{\textrm{t}} 。然而,这是不稳定的,我们没有设法获得有意义的结果。探索替代损失函数的结果见附录。

5.2 Model Architecture Design
5.2模型架构设计

Anchor Density: 锚定密度:

One of the most important design factors in a one-stage detection system is how densely it covers the space of possible image boxes. Two-stage detectors can classify boxes at any position, scale, and aspect ratio using a region pooling operation [10]. In contrast, as one-stage detectors use a fixed sampling grid, a popular approach for achieving high coverage of boxes in these approaches is to use multiple ‘anchors’ [28] at each spatial position to cover boxes of various scales and aspect ratios.
在单级检测系统中,最重要的设计因素之一是它对可能的图像框空间的覆盖密度。两级检测器可以使用区域池化操作对任意位置、比例和纵横比的盒子进行分类[10]。相比之下,由于单级探测器使用固定的采样网格,在这些方法中实现高覆盖率的一种流行方法是在每个空间位置使用多个“锚”[28]来覆盖各种比例和纵横比的盒子。

We sweep over the number of scale and aspect ratio anchors used at each spatial position and each pyramid level in FPN. We consider cases from a single square anchor at each location to 12 anchors per location spanning 4 sub-octave scales (2k/4superscript2𝑘42^{k/4}, for k3𝑘3k\leq 3) and 3 aspect ratios [0.5, 1, 2]. Results using ResNet-50 are shown in Table 1c. A surprisingly good AP (30.3) is achieved using just one square anchor. However, the AP can be improved by nearly 4 points (to 34.0) when using 3 scales and 3 aspect ratios per location. We used this setting for all other experiments in this work.
我们扫描了 FPN 中每个空间位置和每个金字塔级别使用的比例和纵横比锚点的数量。我们考虑的情况从每个位置的单个方形锚点到每个位置的 12 个锚点,跨越 4 个亚八度音阶 ( 2k/4superscript2𝑘42^{k/4} , for k3𝑘3k\leq 3 ) 和 3 个纵横比 [0.5, 1, 2]。使用ResNet-50的结果如表1c所示。仅使用一个方形锚点即可实现令人惊讶的良好 AP (30.3)。但是,当每个位置使用 3 个比例和 3 个纵横比时,AP 可以提高近 4 分(达到 34.0)。我们在这项工作的所有其他实验中都使用了这个设置。

Finally, we note that increasing beyond 6-9 anchors did not shown further gains. Thus while two-stage systems can classify arbitrary boxes in an image, the saturation of performance w.r.t. density implies the higher potential density of two-stage systems may not offer an advantage.
最后,我们注意到,超过 6-9 个锚点并没有显示出进一步的收益。因此,虽然两级系统可以对图像中的任意框进行分类,但性能的饱和度意味着两级系统的较高电位密度可能无法提供优势。

Speed versus Accuracy: 速度与准确性:

Larger backbone networks yield higher accuracy, but also slower inference speeds. Likewise for input image scale (defined by the shorter image side). We show the impact of these two factors in Table 1e. In Figure 2 we plot the speed/accuracy trade-off curve for RetinaNet and compare it to recent methods using public numbers on COCO test-dev. The plot reveals that RetinaNet, enabled by our focal loss, forms an upper envelope over all existing methods, discounting the low-accuracy regime. RetinaNet with ResNet-101-FPN and a 600 pixel image scale (which we denote by RetinaNet-101-600 for simplicity) matches the accuracy of the recently published ResNet-101-FPN Faster R-CNN [20], while running in 122 ms per image compared to 172 ms (both measured on an Nvidia M40 GPU). Using larger scales allows RetinaNet to surpass the accuracy of all two-stage approaches, while still being faster. For faster runtimes, there is only one operating point (500 pixel input) at which using ResNet-50-FPN improves over ResNet-101-FPN. Addressing the high frame rate regime will likely require special network design, as in [27], and is beyond the scope of this work. We note that after publication, faster and more accurate results can now be obtained by a variant of Faster R-CNN from [12].
骨干网络越大,精度越高,推理速度也越慢。输入图像比例(由较短的图像侧定义)也是如此。我们在表1e中显示了这两个因素的影响。在图 2 中,我们绘制了 RetinaNet 的速度/准确性权衡曲线,并将其与最近使用 COCO test-dev 上的公共数据的方法进行了比较。该图显示,RetinaNet在焦点损失的支持下,在所有现有方法上形成了一个上包络线,从而降低了低精度状态。带有 ResNet-101-FPN 和 600 像素图像比例的 RetinaNet(为简单起见,我们用 RetinaNet-101-600 表示)与最近发布的 ResNet-101-FPN Faster R-CNN [ 20] 的准确性相匹配,每张图像的运行时间为 122 毫秒,而每张图像的运行时间为 172 毫秒(均在 Nvidia M40 GPU 上测量)。使用更大的尺度使 RetinaNet 能够超越所有两阶段方法的精度,同时仍然更快。为了获得更快的运行时间,只有一个操作点(500 像素输入),在该操作点上使用 ResNet-50-FPN 比 ResNet-101-FPN 有所改进。解决高帧率问题可能需要特殊的网络设计,如[27]所示,这超出了这项工作的范围。我们注意到,在发表后,现在可以通过 [ 12 的 Faster R-CNN 变体获得更快、更准确的结果。

5.3 Comparison to State of the Art
5.3与现有技术的比较

We evaluate RetinaNet on the challenging COCO dataset and compare test-dev results to recent state-of-the-art methods including both one-stage and two-stage models. Results are presented in Table 2 for our RetinaNet-101-800 model trained using scale jitter and for 1.5×\times longer than the models in Table 1e (giving a 1.3 AP gain). Compared to existing one-stage methods, our approach achieves a healthy 5.9 point AP gap (39.1 vs. 33.2) with the closest competitor, DSSD [9], while also being faster, see Figure 2. Compared to recent two-stage methods, RetinaNet achieves a 2.3 point gap above the top-performing Faster R-CNN model based on Inception-ResNet-v2-TDM [32]. Plugging in ResNeXt-32x8d-101-FPN [38] as the RetinaNet backbone further improves results another 1.7 AP, surpassing 40 AP on COCO.
我们在具有挑战性的COCO数据集上评估了RetinaNet,并将测试开发结果与最近最先进的方法(包括一阶段和两阶段模型)进行了比较。表 2 显示了我们的 RetinaNet-101-800 模型的结果,该模型使用刻度抖动进行训练,比表 1e 中的模型 ×\times 长 1.5(给出 1.3 AP 增益)。与现有的单阶段方法相比,我们的方法与最接近的竞争对手DSSD [ 9]实现了5.9点的AP差距(39.1 vs. 33.2),同时速度也更快,见图2。与最近的两阶段方法相比,RetinaNet比基于Inception-ResNet-v2-TDM的性能最佳的Faster R-CNN模型高出2.3个百分点[ 32]。插入 ResNeXt-32x8d-101-FPN [ 38] 作为 RetinaNet 主干网,可进一步提高 1.7 个 AP,在 COCO 上超过 40 个 AP。

6 Conclusion 6结论

In this work, we identify class imbalance as the primary obstacle preventing one-stage object detectors from surpassing top-performing, two-stage methods. To address this, we propose the focal loss which applies a modulating term to the cross entropy loss in order to focus learning on hard negative examples. Our approach is simple and highly effective. We demonstrate its efficacy by designing a fully convolutional one-stage detector and report extensive experimental analysis showing that it achieves state-of-the-art accuracy and speed. Source code is available at https://github.com/facebookresearch/Detectron [12].
在这项工作中,我们将类不平衡确定为阻止单级目标检测器超越性能最佳的两级方法的主要障碍。为了解决这个问题,我们提出了焦点损失,它对交叉熵损失应用了一个调节项,以便将学习重点放在硬的负面例子上。我们的方法简单而有效。我们通过设计一个完全卷积的单级探测器来证明它的功效,并报告了广泛的实验分析,表明它达到了最先进的精度和速度。源代码可在 https://github.com/facebookresearch/Detectron [ 12] 获得。

Appendix A: Focal Loss*
附录 A:焦点损失*

Refer to caption
Figure 5: Focal loss variants compared to the cross entropy as a function of xt=yxsubscript𝑥t𝑦𝑥x_{\textrm{t}}=yx. Both the original FL and alternate variant FLsuperscriptFL\textrm{FL}^{*} reduce the relative loss for well-classified examples (xt>0subscript𝑥t0x_{\textrm{t}}>0).
图 5:焦点损失变体与交叉熵的比较,作为 的 xt=yxsubscript𝑥t𝑦𝑥x_{\textrm{t}}=yx 函数。原始 FL 和替代变体都 FLsuperscriptFL\textrm{FL}^{*} 减少了分类良好的样本 ( xt>0subscript𝑥t0x_{\textrm{t}}>0 ) 的相对损失。
loss γ𝛾\gamma β𝛽\beta AP AP50 AP75
CE 31.1 49.4 33.0
FL 2.0 34.0 52.5 36.5
FLsuperscriptFL\textrm{FL}^{*} 2.0 1.0 33.8 52.7 36.3
FLsuperscriptFL\textrm{FL}^{*} 4.0 0.0 33.9 51.8 36.4
Table 3: Results of FL and FLsuperscriptFL\textrm{FL}^{*} versus CE for select settings.
表 3:选定设置的 FL 和 FLsuperscriptFL\textrm{FL}^{*} CE 结果。

The exact form of the focal loss is not crucial. We now show an alternate instantiation of the focal loss that has similar properties and yields comparable results. The following also gives more insights into properties of the focal loss.
灶性损失的确切形式并不重要。现在,我们展示了焦点损失的替代实例,该实例具有相似的特性并产生可比较的结果。下面还将对焦点损失的特性进行更多深入了解。

We begin by considering both cross entropy (CE) and the focal loss (FL) in a slightly different form than in the main text. Specifically, we define a quantity xtsubscript𝑥tx_{\textrm{t}} as follows:
我们首先考虑交叉熵 (CE) 和焦点损失 (FL),其形式与正文略有不同。具体来说,我们定义数量 xtsubscript𝑥tx_{\textrm{t}} 如下:

xt=yx,subscript𝑥t𝑦𝑥x_{\textrm{t}}=yx, (6)

where y{±1}𝑦plus-or-minus1y\in\{\pm 1\} specifies the ground-truth class as before. We can then write pt=σ(xt)subscript𝑝t𝜎subscript𝑥tp_{\textrm{t}}=\sigma(x_{\textrm{t}}) (this is compatible with the definition of ptsubscript𝑝tp_{\textrm{t}} in Equation 2). An example is correctly classified when xt>0subscript𝑥t0x_{\textrm{t}}>0, in which case pt>.5subscript𝑝t.5p_{\textrm{t}}>.5.
where y{±1}𝑦plus-or-minus1y\in\{\pm 1\} 如前所述指定 ground-truth 类。然后我们可以写 pt=σ(xt)subscript𝑝t𝜎subscript𝑥tp_{\textrm{t}}=\sigma(x_{\textrm{t}}) (这与等式 2 中的定义 ptsubscript𝑝tp_{\textrm{t}} 兼容)。一个示例在 时 xt>0subscript𝑥t0x_{\textrm{t}}>0 正确分类,在这种情况下 pt>.5subscript𝑝t.5p_{\textrm{t}}>.5

We can now define an alternate form of the focal loss in terms of xtsubscript𝑥tx_{\textrm{t}}. We define ptsuperscriptsubscript𝑝tp_{\textrm{t}}^{*} and FLsuperscriptFL\textrm{FL}^{*} as follows:
现在,我们可以用 来 xtsubscript𝑥tx_{\textrm{t}} 定义焦点损失的另一种形式。我们的定义 ptsuperscriptsubscript𝑝tp_{\textrm{t}}^{*} 如下 FLsuperscriptFL\textrm{FL}^{*}

pt=σ(γxt+β),superscriptsubscript𝑝t𝜎𝛾subscript𝑥t𝛽\displaystyle p_{\textrm{t}}^{*}=\sigma(\gamma x_{\textrm{t}}+\beta), (7)
FL=log(pt)/γ.superscriptFLsuperscriptsubscript𝑝t𝛾\displaystyle\textrm{FL}^{*}=-\log(p_{\textrm{t}}^{*})/\gamma. (8)

FLsuperscriptFL\textrm{FL}^{*} has two parameters, γ𝛾\gamma and β𝛽\beta, that control the steepness and shift of the loss curve. We plot FLsuperscriptFL\textrm{FL}^{*} for two selected settings of γ𝛾\gamma and β𝛽\beta in Figure 5 alongside CE and FL. As can be seen, like FL, FLsuperscriptFL\textrm{FL}^{*} with the selected parameters diminishes the loss assigned to well-classified examples.
FLsuperscriptFL\textrm{FL}^{*} 有两个参数 γ𝛾\gammaβ𝛽\beta ,用于控制损耗曲线的陡峭度和偏移。 FLsuperscriptFL\textrm{FL}^{*} 我们在图 5 中绘制了 γ𝛾\gammaβ𝛽\beta 的两个选定设置以及 CE 和 FL。可以看出,与 FL 一样, FLsuperscriptFL\textrm{FL}^{*} 使用选定的参数可以减少分配给分类良好的示例的损失。

We trained RetinaNet-50-600 using identical settings as before but we swap out FL for FLsuperscriptFL\textrm{FL}^{*} with the selected parameters. These models achieve nearly the same AP as those trained with FL, see Table 3. In other words, FLsuperscriptFL\textrm{FL}^{*} is a reasonable alternative for the FL that works well in practice.
我们使用与以前相同的设置训练 RetinaNet-50-600,但我们 FLsuperscriptFL\textrm{FL}^{*} 用选定的参数替换了 FL。这些模型实现的 AP 与使用 FL 训练的模型几乎相同,请参阅表 3。换句话说, FLsuperscriptFL\textrm{FL}^{*} 是 FL 的合理替代方案,在实践中效果很好。

Refer to caption
Figure 6: Derivates of the loss functions from Figure 5 w.r.t. x𝑥x.
图 6:从图 5 w.r.t. x𝑥x 中推导出的损失函数。
Refer to caption
Figure 7: Effectiveness of FLsuperscriptFL\textrm{FL}^{*} with various settings γ𝛾\gamma and β𝛽\beta. The plots are color coded such that effective settings are shown in blue.
图 7: FLsuperscriptFL\textrm{FL}^{*} 各种设置 γ𝛾\gammaβ𝛽\beta .这些图采用颜色编码,以便有效设置以蓝色显示。

We found that various γ𝛾\gamma and β𝛽\beta settings gave good results. In Figure 7 we show results for RetinaNet-50-600 with FLsuperscriptFL\textrm{FL}^{*} for a wide set of parameters. The loss plots are color coded such that effective settings (models converged and with AP over 33.5) are shown in blue. We used α=.25𝛼.25\alpha=.25 in all experiments for simplicity. As can be seen, losses that reduce weights of well-classified examples (xt>0subscript𝑥t0x_{\textrm{t}}>0) are effective.
我们发现各种 γ𝛾\gamma 设置 β𝛽\beta 都取得了良好的效果。在图 7 中,我们显示了 RetinaNet-50-600 的结果,其中包含 FLsuperscriptFL\textrm{FL}^{*} 一组广泛的参数。损失图采用颜色编码,使有效设置(模型收敛且 AP 超过 33.5)以蓝色显示。为了简单起见,我们在所有实验中都使用了 α=.25𝛼.25\alpha=.25 。可以看出,降低分类良好的示例( xt>0subscript𝑥t0x_{\textrm{t}}>0 )的权重的损失是有效的。

More generally, we expect any loss function with similar properties as FL or FLsuperscriptFL\textrm{FL}^{*} to be equally effective.
更一般地说,我们期望任何具有与 FL 相似性质或 FLsuperscriptFL\textrm{FL}^{*} 同样有效的损失函数。

Appendix B: Derivatives 附录B:衍生品

For reference, derivates for CE, FL, and FLsuperscriptFL\textrm{FL}^{*} w.r.t. x𝑥x are:
作为参考,CE、FL 和 FLsuperscriptFL\textrm{FL}^{*} w.r.t. x𝑥x 的导数为:

dCEdx=y(pt1)𝑑CE𝑑𝑥𝑦subscript𝑝t1\displaystyle\frac{d\textrm{CE}}{dx}=y(p_{\textrm{t}}-1) (9)
dFLdx=y(1pt)γ(γptlog(pt)+pt1)𝑑FL𝑑𝑥𝑦superscript1subscript𝑝t𝛾𝛾subscript𝑝tsubscript𝑝tsubscript𝑝t1\displaystyle\frac{d\textrm{FL}}{dx}=y(1-p_{\textrm{t}})^{\gamma}(\gamma p_{\textrm{t}}\log(p_{\textrm{t}})+p_{\textrm{t}}-1) (10)
dFLdx=y(pt1)𝑑superscriptFL𝑑𝑥𝑦superscriptsubscript𝑝t1\displaystyle\frac{d\textrm{FL}^{*}}{dx}=y(p_{\textrm{t}}^{*}-1) (11)

Plots for selected settings are shown in Figure 6. For all loss functions, the derivative tends to -1 or 0 for high-confidence predictions. However, unlike CE, for effective settings of both FL and FLsuperscriptFL\textrm{FL}^{*}, the derivative is small as soon as xt>0subscript𝑥t0x_{\textrm{t}}>0.
所选设置的图如图 6 所示。对于所有损失函数,对于高置信度预测,导数趋于 -1 或 0。然而,与 CE 不同的是,对于 FL 和 FLsuperscriptFL\textrm{FL}^{*} 的有效设置,导数很小 xt>0subscript𝑥t0x_{\textrm{t}}>0

References 引用

  • [1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.
    S. Bell、CL Zitnick、K. Bala 和 R. Girshick。内外网络:使用跳跃池和循环神经网络检测上下文中的对象。在CVPR中,2016年。
  • [2] S. R. Bulo, G. Neuhold, and P. Kontschieder. Loss max-pooling for semantic image segmentation. In CVPR, 2017.
    SR Bulo、G. Neuhold 和 P. Kontschieder。用于语义图像分割的损失最大池化。在CVPR中,2017年。
  • [3] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object detection via region-based fully convolutional networks. In NIPS, 2016.
    戴彦, 李彦, K.R-FCN:通过基于区域的全卷积网络进行目标检测。在NIPS中,2016年。
  • [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
    N.达拉尔和B.特里格斯。用于人体检测的定向梯度直方图。在CVPR中,2005年。
  • [5] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009.
    P. Dollár、Z. Tu、P. Perona 和 S. Belongie。集成通道功能。在 BMVC 中,2009 年。
  • [6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.
    D. Erhan、C. Szegedy、A. Toshev 和 D. Anguelov。使用深度神经网络进行可扩展的目标检测。在CVPR中,2014年。
  • [7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. IJCV, 2010.
    M. Everingham、L. Van Gool、CK Williams、J. Winn 和 A. Zisserman。PASCAL 可视化对象类 (VOC) 挑战赛。IJCV,2010年。
  • [8] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Cascade object detection with deformable part models. In CVPR, 2010.
  • [9] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional single shot detector. arXiv:1701.06659, 2016.
  • [10] R. Girshick. Fast R-CNN. In ICCV, 2015.
  • [11] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [12] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
  • [13] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer series in statistics Springer, Berlin, 2008.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, 2017.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV. 2014.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [17] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017.
  • [18] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
  • [19] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
  • [20] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed. SSD: Single shot multibox detector. In ECCV, 2016.
  • [23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [24] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to segment object candidates. In NIPS, 2015.
  • [25] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In ECCV, 2016.
  • [26] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [27] J. Redmon and A. Farhadi. YOLO9000: Better, faster, stronger. In CVPR, 2017.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [29] H. Rowley, S. Baluja, and T. Kanade. Human face detection in visual scenes. Technical Report CMU-CS-95-158R, Carnegie Mellon University, 1995.
  • [30] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
  • [31] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In CVPR, 2016.
  • [32] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Beyond skip connections: Top-down modulation for object detection. arXiv:1612.06851, 2016.
  • [33] K.-K. Sung and T. Poggio. Learning and Example Selection for Object and Pattern Detection. In MIT A.I. Memo No. 1521, 1994.
  • [34] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI Conference on Artificial Intelligence, 2017.
  • [35] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
  • [36] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc. on Vision, Image, and Signal Processing, 1994.
  • [37] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
  • [38] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
  • [39] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, 2014.