Journals & Magazines >IEEE Transactions on Geoscien... >Volume: 61

KCPNet: Knowledge-Driven Context Perception Networks for Ship Detection in Infrared Imagery
KCPNet：用于红外图像中船舶检测的知识驱动上下文感知网络

Download PDF
下载 PDF
Download References
下载参考文献
Request Permissions
请求权限
Save to
保存到
Alerts 警报

Abstract:

Ship detection plays a crucial role in a variety of military and civilian marine inspection applications. Infrared images are irreplaceable data sources for ship detectio...Show More

Metadata

Abstract: 摘要：

Ship detection plays a crucial role in a variety of military and civilian marine inspection applications. Infrared images are irreplaceable data sources for ship detection due to their strong adaptability and excellent all-weather reconnaissance ability. However, previous researches mainly focus on visible light or synthetic aperture radar (SAR) ship detection, while infrared ship detection is left in a huge blind spot. The main obstacles to this dilemma lie in the absence of public datasets, small scale, and poor semantic information of infrared ships, and severe clutter in complex ocean environments. To address the above challenges, we propose a knowledge-driven context perception network (KCPNet) and construct a public dataset called infrared ship detection dataset (ISDD). In KCPNet, aiming at the small scale of infrared ships, a balanced feature fusion network (BFF-Net) is proposed to balance information from all backbone layers and generate nonlocal features with balanced receptive fields. Moreover, considering the key role of contextual information, a contextual attention network (CA-Net) is designed to improve robustness in complex scenes by enhancing target and contextual information and suppressing clutter. Inspired by prior knowledge of human cognitive processes, we construct a novel knowledge-driven prediction head to autonomously learn visual features and back-propagate the knowledge throughout the whole network, which can efficiently reduce false alarms. Extensive experiments demonstrate that the proposed KCPNet achieves state-of-the-art performance on ISDD. Source codes and ISDD are accessible at https://github.com/yaqihan-9898.
船舶检测在各种军事和民用海洋检验应用中发挥着至关重要的作用。红外图像由于其强大的适应性和极佳的全天候侦察能力，是船舶检测不可替代的数据源。然而，以往的研究主要集中在可见光或合成孔径雷达 (SAR) 船舶检测，而红外船舶检测则处于巨大的盲区。造成这种困境的主要障碍在于缺乏公共数据集、红外船舶的小规模和较差的语义信息，以及复杂海洋环境中的严重杂波。为了应对上述挑战，我们提出了一种知识驱动的上下文感知网络 (KCPNet) 并构建了一个名为红外船舶检测数据集 (ISDD) 的公共数据集。在 KCPNet 中，为了解决红外船舶的小规模问题，提出了一种平衡特征融合网络 (BFF-Net)，平衡所有主干层的信息，并生成具有平衡感受野的非局部特征。此外，考虑到上下文信息的关键作用，设计了一个上下文注意力网络（CA-Net），通过增强目标和上下文信息，抑制杂乱信息，从而提高复杂场景下的鲁棒性。受人类认知过程先验知识的启发，我们构建了一个新颖的知识驱动预测头，以自主学习视觉特征并将知识反向传播到整个网络，从而有效地减少误报。大量的实验表明，所提出的 KCPNet 在 ISDD 上取得了最先进的性能。源代码和 ISDD 可在 https://github.com/yaqihan-9898 获取。

Published in: IEEE Transactions on Geoscience and Remote Sensing ( Volume: 61)
发表在：IEEE 地球科学与遥感学报 (卷号：61)

Article Sequence Number: 5000219
文章序号：5000219

Date of Publication: 30 December 2022
出版日期：2022 年 12 月 30 日

ISSN Information: ISSN 信息：

DOI: 10.1109/TGRS.2022.3233401

Funding Agency: 资助机构：

Contents 内容

SECTION I. 第一部分。

Introduction 介绍

In decades, more and more attention has been paid to marine resource development, space utilization, and environmental protection. As the dominant carrier of marine transportation, ships are the key targets in marine inspection. Ship detection in remote sensing images is of great strategic significance in civil and military fields such as marine traffic management, marine rescue, smuggling supervision, and national security. Unfortunately, in practical applications, the ocean environment can be seriously complex and contain various false alarms, including buildings on land, thick clouds, and severe weather such as strong winds and waves, and dense fog. Therefore, there is an urgent need to develop accurate and robust ship detection techniques under complex ocean environment.
几十年来，人们越来越关注海洋资源开发、空间利用和环境保护。作为海洋运输的主要载体，船舶是海洋检验的关键目标。遥感图像中的船舶检测在海洋交通管理、海上救援、走私监管以及国防安全等民用和军事领域具有重要的战略意义。然而，在实际应用中，海洋环境可能非常复杂，包含各种误报，包括陆地建筑物、厚云层以及强风、巨浪等恶劣天气和浓雾。因此，迫切需要在复杂海洋环境下开发准确可靠的船舶检测技术。

As shown in Fig. 1, there are three main data sources for ship detection tasks: visible images, synthetic aperture radar (SAR) images, and infrared images. Visible images provide rich semantic information but cannot work at night and in extreme weather. Fortunately, SAR images can work around the clock. Thus, SAR image is an important complement for ship detection. Compared with visible images and SAR images, infrared imaging detection in certain sensitive bands has better capabilities of detection, positioning, and identification, which are crucial in military applications. In addition, infrared images play an important role in all-weather and long-distance detection technology, especially in space-based detection applications. Therefore, infrared images are indispensable data sources for ship detection task. However, current ship detection researches [1], [2], [3], [4] are mainly based on visible images and SAR images. Few researches focus on infrared images, let alone the pursuit of high detection performance. One of the reasons for this research blind spot is the lack of public infrared datasets due to the confidentiality and inaccessibility of infrared images. Therefore, it is necessary to construct a new benchmark and study high-performance and robust infrared ship detection.
如 Fig. 1 所示，船舶检测任务主要有三个数据源：可见光图像、合成孔径雷达（SAR）图像和红外图像。可见光图像提供丰富的语义信息，但无法在夜间和极端天气下工作。幸运的是，SAR 图像可以全天候工作。因此，SAR 图像对于船舶检测来说是一个重要的补充。与可见光图像和 SAR 图像相比，红外成像检测在某些敏感波段具有更好的检测、定位和识别能力，这对于军事应用至关重要。此外，红外图像在全天候和远距离检测技术中发挥着重要作用，尤其是在基于空间的检测应用中。因此，红外图像对于船舶检测任务来说是不可或缺的数据源。然而，当前的船舶检测研究 [1] ， [2] ， [3] ， [4] 主要基于可见光图像和 SAR 图像。很少有研究关注红外图像，更不用说追求高检测性能了。造成这种研究盲点的其中一个原因是缺乏公开的红外数据集，这是由于红外图像的保密性和不可访问性。因此，有必要构建新的基准，并研究高性能且鲁棒的红外舰船检测。

Fig. 1. - Three main data sources for the practical ship detection tasks: (a) visible images; (b) SAR images; and (c) infrared images.

Fig. 1. 图 1.

Three main data sources for the practical ship detection tasks: (a) visible images; (b) SAR images; and (c) infrared images.
三个主要的数据来源用于实际船舶检测任务：(a) 可见光图像；(b) SAR 图像；和 (c) 红外图像。

Show All

In contrast to object detection in natural scenes, the infrared ship detection task has many challenges in the following respects.
与自然场景中的目标检测相比，红外船舶检测任务在以下方面面临许多挑战。

Low-resolution and single-channel information of infrared images.
红外图像的低分辨率和单通道信息。
Complex scenes and weather. Grayscale values of the ocean surface and land present bipolarity [5] due to the temperature difference between day and night (shown in Fig. 2). The contrast between ships and the background is diverse because of the thin cloud and temperature change. Moreover, thick clouds and strip-shaped buildings on the land are prone to become false alarms.
复杂的场景和天气。海洋表面和陆地的灰度值由于昼夜温差（见 Fig. 2 ）而呈现两极分化 [5] 。由于薄云和温度变化，船只与背景的对比差异很大。此外，陆地上厚云和条状建筑容易产生误报。
Inherent characteristics of ship targets. Ships in infrared remote sensing images are very small and lack of semantic features. Ships parked near the coast are easily submerged by land.
船舶目标的固有特征。红外遥感图像中的船舶非常小，缺乏语义特征。靠近海岸停泊的船只很容易被陆地淹没。

Fig. 2. - Bipolarity of grayscale values: (a) during the day, the land is bright and the sea is dark and (b) at night, the land is dark and the sea is bright.

Fig. 2. 图 2.

Bipolarity of grayscale values: (a) during the day, the land is bright and the sea is dark and (b) at night, the land is dark and the sea is bright.
灰度值的双重性：(a)白天，陆地明亮，海洋黑暗；(b)夜晚，陆地黑暗，海洋明亮。

Show All

Current detection approaches can be preliminarily divided into two categories: conventional detection methods based on visual feature modeling and deep learning-based algorithms. The former is model-driven approaches, which rely on expert prior knowledge to design handcraft features [6], [7]. In decades, model-driven approaches have achieved great breakthroughs in infrared small target detection [8], [9], [10], [11]. However, the robustness under complex scenes is unsatisfactory because the recognition accuracy is limited by visual feature representation and manual parameter tuning. The latter are data-driven algorithms based on convolutional neural networks (CNNs) [12], due to high efficiency and stability. Many excellent networks, such as Faster RCNN [13], YOLO [14], SSD [15], and CenterNet [16], perform well in natural detection scenes. However, due to the black-box properties of CNN, it is not easy to make targeted improvements in specific tasks. For infrared ship detection, previous networks show various limitations such as: 1) many false alarms cannot be identified only with deep features and 2) after multiple pooling operations, the information of small ships is greatly lost or even submerged in high-layer feature maps.
当前检测方法大致可分为两类：基于视觉特征建模的传统检测方法和基于深度学习的算法。前者是模型驱动方法，依赖专家先验知识来设计人工特征 [6] ， [7] 。数十年来，模型驱动方法在红外小目标检测方面取得了重大突破 [8] ， [9] ， [10] ， [11] 。然而，在复杂场景下的鲁棒性令人不满意，因为识别精度受限于视觉特征表示和手动参数调整。后者是基于卷积神经网络 (CNNs) 的数据驱动算法 [12] ，由于其高效率和稳定性。许多优秀的网络，例如 Faster RCNN [13] ，YOLO [14] ，SSD [15] 和 CenterNet [16] ，在自然场景检测中表现良好。然而，由于 CNN 的黑箱特性，在特定任务中难以进行有针对性的改进。对于红外舰船检测，之前的网络显示各种局限性，例如：1) 仅凭深度特征无法识别许多误报；2) 多次池化操作后，小船的信息严重丢失，甚至淹没在高层特征图中。

To address the above challenges, we propose a knowledge-driven context perception network (KCPNet), which is an end-to- end two-stage network. The design of KCPNet is inspired by our previous work [34] on ship detection in optical images of visible light band, further solving the missed detection of small ships with weak semantics and false alarm interference in low-quality infrared imagery. KCPNet combines a novel feature fusion network with receptive field expand modules to improve the recall of small ships, proposes an efficient attention mechanism for complex scenes, and introduces well-designed visual features as prior knowledge to drive the prediction head for false alarm removal. KCPNet achieves the state-of-the-art performance on the first public infrared ship detection benchmark called infrared ship detection dataset (ISDD) published in this article. The main contributions of this study are as follows.
为了解决上述挑战，我们提出了一种知识驱动的上下文感知网络 (KCPNet)，它是一个端到端两阶段网络。KCPNet 的设计灵感来源于我们之前在可见光波段光学图像中船舶检测方面的研究 [34] ，进一步解决了低质量红外图像中小型船舶的漏检、弱语义和虚警干扰问题。KCPNet 结合了一种新颖的特征融合网络和感受野扩展模块，以提高小型船舶的召回率；提出了一种用于复杂场景的有效注意力机制；并引入精心设计的视觉特征作为先验知识来驱动预测头部以消除虚警。KCPNet 在本文中发布的首个公开红外船舶检测基准数据集 (ISDD) 上取得了最先进的性能。这项研究的主要贡献如下。

Aiming at the small scale of ships, a balanced feature fusion network (BFF-Net) is proposed to ensure the efficient information transmission of small targets and expand receptive fields by constructing balanced local and nonlocal features.
针对小型船舶，提出了一种平衡特征融合网络（BFF-Net），以确保小型目标的有效信息传递，并通过构建平衡的局部和非局部特征来扩展感受野。
Considering the thin semantic information of infrared targets and false alarms in complex scenes, we design a pixel-level contextual attention network (CA-Net) to strengthen target and contextual information simultaneously and suppress severe clutter in complex scenes.
考虑到复杂场景中红外目标和虚警的语义信息稀疏，我们设计了一种像素级上下文注意力网络 (CA-Net)，以同时增强目标和上下文信息，并抑制复杂场景中的严重杂波。
To further reduce false alarms, a novel knowledge-driven prediction head integrates the well-designed visual features as prior knowledge through a supervised regression branch. The prior knowledge can be back-propagated to the entire network, and the learned features with strong physical interpretability can guide the final prediction jointly.
为了进一步减少误报，一个新颖的知识驱动预测头将精心设计的视觉特征作为先验知识整合到监督回归分支中。先验知识可以反向传播到整个网络，学习到的具有强物理可解释性的特征可以共同指导最终的预测。
To advance the research of ship detection in infrared images, we construct a new dataset called ISDD, which contains 1284 infrared images with 3061 ship instances. As far as we know, ISDD is the first public benchmark for infrared ship detection.
为了推进红外图像中船舶检测的研究，我们构建了一个名为 ISDD 的新数据集，其中包含 1284 张红外图像和 3061 个船舶实例。据我们所知，ISDD 是第一个公开的红外船舶检测基准。

The rest of this article is organized as follows. Section II briefly introduces the model-driven ship detection schemes and the deep learning-based ship detection networks. A detailed description of the proposed dataset ISDD is given in Section III. Section IV contains details of the proposed framework design. Section V analyzes the experimental results based on the ablation study and comparison with the state-of-the-art approaches. Section VI presents the conclusions.
本文的其余部分组织如下。 Section II 简要介绍了基于模型的船舶检测方案和基于深度学习的船舶检测网络。 Section III 中详细描述了提出的数据集 ISDD。 Section IV 包含所提出框架设计的细节。 Section V 基于消融研究和与最先进方法的比较分析了实验结果。 Section VI 总结了结论。

SECTION II. 第二部分。

Related Work 相关工作

A. Model-Driven Schemes 基于模型的方案

Model-driven schemes can be further divided into two categories: feature modeling-based methods and visual saliency-based methods. Both are required to segment the marine area to eliminate the false alarms in land area and reduce the computational complexity.
基于模型的方案可进一步分为两类：基于特征建模的方法和基于视觉显著性方法。两者都需要分割海洋区域以消除陆地区域的误报并降低计算复杂度。

Feature modeling-based methods follow the detection pattern of “candidate region proposal + physical feature extraction + classifier.” How to select or design the visual feature is decisive to the detection performance. The commonly used visual features include geometric features, edge features, contrast features, and texture features. Xia et al. [17] introduced a dynamic model to fuse the geometric features of images and employed support vector machine (SVM) to detect ships. Krizhevsky et al. [18] took means, variances, and wavelet changes as feature descriptors. Aiming at the high aspect ratio (AR) rectangular shape of ships, Qi et al. [19] designed a novel descriptor called ship histogram of oriented gradient (S-HOG) to characterize the gradient symmetry of ship sides.
基于特征建模的方法遵循“候选区域建议 + 物理特征提取 + 分类器”的检测模式。如何选择或设计视觉特征对检测性能至关重要。常用的视觉特征包括几何特征、边缘特征、对比度特征和纹理特征。Xia 等人 [17] 引入了一种动态模型来融合图像的几何特征，并使用支持向量机 (SVM) 来检测船舶。Krizhevsky 等人 [18] 将均值、方差和波浪变化作为特征描述符。针对船舶的高长宽比 (AR) 矩形形状，Qi 等人 [19] 设计了一种名为船舶方向梯度直方图 (S-HOG) 的新描述符，以表征船舶侧面的梯度对称性。

Visual saliency-based methods simulate the human brain to focus on the salient regions quickly and promptly locate and perceive the targets in complex scenarios. There are two categories of visual saliency-based models: spatial saliency models and frequency saliency models. Itti et al. [20] proposed a ground-breaking spatial saliency model, which constructs the final saliency map based on intensity, color, and orientation. Harel et al. [21] defined a Markov chain-based saliency model and treated the balanced distribution over map locations as saliency values. Frequency saliency is defined in the Fourier spectrum or Gabor spectrum. Guo et al. [22] adopted the spectral residual [23] to obtain the initial target curves. Based on the entropy information, Xu et al. [24] modeled a combined saliency structure with self-adaptive weights.
基于视觉显著性的方法模拟人脑，快速关注显著区域，迅速定位和感知复杂场景中的目标。视觉显著性模型分为两类：空间显著性模型和频率显著性模型。Itti 等人 [20] 提出了一个开创性的空间显著性模型，该模型基于强度、颜色和方向构建最终显著性图。Harel 等人 [21] 定义了一个基于马尔可夫链的显著性模型，并将地图位置上的平衡分布视为显著性值。频率显著性在傅里叶谱或 Gabor 谱中定义。Guo 等人 [22] 采用频谱残差 [23] 来获得初始目标曲线。基于熵信息，Xu 等人 [24] 建立了一个具有自适应权重的组合显著性结构。

Each handcrafted feature in model-driven schemes has strong physical meaning and rich local or global semantic information. Unfortunately, the detection performance relies on parameter tuning and lacks robustness in complex marine scenarios. Thus, model-driven ship detection methods reach satisfactory performance only in high-quality images with simple scenes.
每个手工制作的模型驱动方案中的特征都具有强烈的物理意义和丰富的局部或全局语义信息。不幸的是，检测性能依赖于参数调整，并且在复杂的海洋场景中缺乏鲁棒性。因此，模型驱动的船舶检测方法仅在高质量、场景简单的图像中达到令人满意的性能。

B. Deep Learning-Based Schemes

In recent years, the boom of compute capability of computers has prompted the rapid development of data-driven algorithms. As a typical representative, deep learning-based approaches have achieved excellent performance in computer vision field. In line with whether to propose candidate regions, object detection networks can be classified into two categories: single-stage networks and two-stage networks. Single-stage networks represented by YOLO [14], SSD [15], and RetinaNet [25] have low computational complexity and high speed due to the omission of the region proposal step. Two-stage networks such as Faster RCNN [13] tend to have higher accuracy and are not bothered with the imbalance between positive and negative samples.

For challenges in ship detection task, many researchers pay efforts to make effective improvements based on the above networks. For higher accuracy and robustness in large-scale infrared images, Zhou et al. [26] proposed a simple one-stage ship detection network to learn joint features from multiresolution infrared images. As one of the few high-quality infrared ship detection studies, Wang et al. [5] employed spectrum characteristics for coarse ship detection and designed a light CNN for fine detection, thus achieving competitive detection results under limited computation and small storage space. Many ship detection networks based on visible light or SAR images are also inspiring. Multilayer feature fusion structures attracted the attention of many researchers to improve the recall of small ships. For example, Jiao et al. [27] proposed a dense feature fusion network, finding that the retention of low-level features can greatly increase the recall of small ships. Li et al. [28] introduced an attention mechanism into the feature fusion process, further enhancing the efficiency of information passing. In addition, attention mechanisms were also adopted to reduce false alarms against complex scenes. Fu et al. [4] proposed an anchor-free network with attention-guided balanced pyramid (ABP) to extract salient features in complex scenes adaptively and efficiently. Cui et al. [29] introduced a spatial shuffle-group enhance (SSE) attention module into CenterNet to achieve higher accuracy on ship detection in large-scale SAR images. In addition, some studies adopted contextual information to enhance the networks’ ability to identify false alarms. For example, Kang et al. [30] concated region of interest (RoI) features and contextual features around RoIs in prediction head to reduce false alarms. Li et al. [31] improved YOLOv5 with dilated convolution modules to enlarged receptive fields for better robustness in complex scenes.

Most of the above ship detection researches are conducted on visible images or SAR images, while the researches on infrared images are quite rare. The main obstacle is the lack of datasets. Moreover, due to the poor physical interpretability of neural networks, it is difficult to make targeted improvements to particular problems. Thus, current detection networks do not well address the specific challenges of ship detection in infrared images. Therefore, here comes a new research idea: introduce the knowledge in model-driven approaches to data-driven networks to combine their strengths and complement their weaknesses. There have been some novel attempts to combine data-driven and model-driven in other fields. For example, Wang et al. [32] detected hard exudate based on multifeature joint representation for diabetic retinopathy screening. Arvor et al. [33] proposed a knowledge-driven scheme for land cover mapping over large areas. Wang et al. [34] designed a model-data-knowledge-driven and deep learning (MDK-DL) method for land surface temperature retrieval. However, these approaches employ model-driven and data-driven modules in separate stages of the entire framework, which are not end-to-end. This article explores efficient solutions for combining data-driven approaches and model-driven networks in an end-to-end way for the infrared ship detection task.

SECTION III.

Infrared Ship Detection Dataset

A. Motivation

Datasets play a critical role in data-driven models. However, as far as we know, there is no public dataset for ship detection in infrared images. Previous infrared ship detection studies adopt two main strategies to deal with the absence of public datasets: using small-scale private datasets or simulating the infrared images based on visible remote sensing images. Most of the private datasets contain no more than 500 images. These private datasets are difficult to obtain, and the dataset scales are not suitable for the study in this article. As for the simulated datasets, there are three main approaches to simulating visible images into infrared images: generative adversarial networks (GANs) [35], variational autoencoder (VAE) [36], and traditional image processing ways. No matter which simulation approach is utilized, the simulated images cannot imitate various environmental factors under infrared imaging. To sum up, the dataset strategies adopted in previous studies cannot meet the needs of high-performance infrared ship detection research. Therefore, we develop the ISDD as a benchmark for research in this article. More importantly, hope ISDD to be helpful to other researchers in related fields.

B. Collection and Annotations of ISDD

As mentioned above, synthetic infrared remote sensing images are far inferior to the real collected ones. Our ISSD collects real infrared remote sensing images taken by the Landsat8 satellite, which carries the Operational Land Imager (OLI) with nine imaging bands and Thermal Infrared Sensor (TIRS) with two thermal infrared imaging bands. We fuse three OLI bands of Band 7, Band 5, and Band 4 to obtain short-wave infrared images. All images in ISDD are preprocessed by radiometric calibration and FLAASH atmospheric correction. In order to expand the diversity of the dataset, ISDD collects images taken in the United States, China, Japan, Australia, Europe, etc.

ISDD contains a total of 1284 infrared images with 3061 ship instances. The size of images is 500 ×500 , and the resolution is 10–20 m. We split the dataset into training set, validation set, and test set according to the ratio of 6:1:3. The detailed statistics of the dataset division are listed in Table I. ISDD collects 373 inshore scenes with 924 instances and 911 offshore scenes with 2137 instances. Fig. 3 statistics the distribution of instance number per image for inshore and offshore scenarios. An image has an average of ten ships and a maximum of 27 ships. For ship annotation, many visible image datasets use oriented bounding boxes to annotate ships to prevent box overlap when ships are densely distributed. Actually, only tiny ships, such as yachts, are often densely packed. However, limited by the resolution of infrared images, tiny ships are largely invisible in infrared images. Therefore, we annotate the ship instances with horizontal bounding boxes, which are enough for dispersed medium and large ships. As an important reference indicator for the anchor-based model, the AR of the bounding boxes in ISDD ranges from 1.0 to 5.14, with an average AR of 1.84, as shown in Fig. 4(a).

TABLE I Dataset Division of ISDD

Fig. 3.

Distribution of instance number per image for inshore and offshore scenarios.

Show All

Fig. 4.

Statistics of instances in ISDD: (a) AR; (b) instance area; and (c) instance size distribution comparison between ISDD, optical image dataset DOSR [34], and SAR image dataset SSDD [35]. To be fair, the short side of images in all datasets is resized to 500 pixels.

Show All

C. Properties of ISDD

The proposed ISDD has the following properties.

Small Scale of Ship Instances: As shown in Fig. 4(b), the proportion of instance area to the whole image area ranges from 0.014% to 2.39%, with the average proportion of only 0.18%. Considering the height of bounding box as a reflection of the ship size, Fig. 4(c) shows the instance size distribution comparison between three different ship detection datasets, including ISDD, an optical ship recognition dataset called dataset for oriented ship recognition (DOSR) [37], and the widely used SAR ship detection dataset (SSDD) [38]. The height of ships in ISDD ranges from 4 to 73 pixels. Instances in ISDD have an average height of 19.59 pixels, which is much smaller than 71.22 of DOSR and 58.94 of SSDD. The above statistics indicate that the scale of the ships in ISDD is quite small, which brings great challenges to the detection networks.
Scene Diversity: The various scenarios in ISDD are complex and close to practical applications. According to the state of the ship stern, there are two types of scenes: ships with trail and without trail. As shown in Fig. 5(a), most ship trailing waves are manifested as bright stripes of water streaks behind the ships, and a few trailing waves appear as sectorial water streaks behind the ships. For ships with trail, ISDD only annotates the hull part but not the trail part. In addition, there are not only a large number of offshore scenes in ISDD but also a wealth of nearshore scenes with more complex backgrounds. ISDD contains 373 inshore scenes with 924 instances and 911 offshore scenes with 2137 instances. As shown in Fig. 5(b), inshore scenes contain many buildings on and near land, which may greatly increase the false alarm rate. In addition, there are also moored ships shown in Fig. 5(c), which are easily submerged by land areas with similar grayscale and lead to missing detection. Moreover, nearshore scenes also have the grayscale inversion phenomenon due to diurnal temperature fluctuation shown in Fig. 2.
Weather Condition Diversity: Images in ISDD are shooting in a variety of weather conditions. A typical scenario is the windy weather shown in Fig. 5(d), where irregular ripples appear on the sea surface. Fig. 5(e) and (f) shows the scenes of thin and thick clouds, respectively. Under the weather with thin clouds, the contrast between ships and background will be sharply reduced, while the dotted thick clouds are severe clutter and are prone to be detected as false alarms.
Publicly Available: As the first public dataset for ship detection in infrared remote sensing images, ISDD is publicly available on GitHub: https://github.com/yaqihan-9898/ISDD.

Fig. 5.

Scene diversity and weather condition diversity in ISDD: (a) ships with trail; (b) inshore scenes; (c) berthing scene; (d) sea wave scenes; (e) thin clouds scenes; and (f) thick clouds scenes.

Show All

SECTION IV.

Methodology

The overall framework of the proposed KCPNet is illustrated in Fig. 6. KCPNet is an end-to-end network with a two-stage structure. In the first stage, balanced feature maps with densely structured receptive fields are extracted by BFF-Net. CA-Net further suppresses clutters against complex scenes and highlights target and contextual information. The enhanced features are sent to RPN to generate proposals. In the second stage, visual features are served as prior knowledge to guide visual awareness learning. Deep features and learned visual features jointly contribute to the final prediction.

Fig. 6.

Overall framework of KCPNet. There are three subnetworks in KCPNet: 1) BFF-Net efficiently fuses features from all backbone layers and introduces nonlocal information with balanced receptive fields; 2) CA-Net simultaneously enhances targets and corresponding contextual information, and suppresses clutter in complex scenes; and 3) knowledge-driven prediction head learns visual features to conduct final prediction task jointly with deep features and back-propagates knowledge to the whole network.

Show All

A. Balanced Feature Fusion Network

The network neck represented by FPN [39] undertakes the task of fusing and postprocessing the backbone features. We design a BFF-Net for infrared ship detection. The proposed BFF-Net balances features in three aspects. First, balance semantic information and resolution in a single feature map layer for small targets. Second, many feature fusion networks [36], [39], [40], [41] add or concate different layers equally. However, in fact, different feature maps contribute differently to the whole detection work. BFF-Net balances the information ratio of different layers and different channels to enhance efficient information and reduce information redundancy. Third, a receptive field expansion module (RFEM) in BFF-Net balances the structure of different receptive fields. The structure of BFF-Net is illustrated in Fig. 7.

Fig. 7.

Structure of BFF-Net. FFM is the first component of BFF-Net, which fuses backbone features to an appropriate resolution for small targets with a balanced information structure of features from different layers. The fused features are input to the RFEM to catch nonlocal contextual information.

Show All

There are two core components in BFF-Net: a feature fusion module (FFM) and an RFEM. We utilize Block1–Block4 of ResNet as our backbone. To speed up convergence and avoid overfitting, we employ Block5 of ResNet and global average pooling (GAP) to substitute for the commonly used two FC layers in the prediction head. Thus, our backbone only generates four feature map layers C1C4 . The feature fusion process is applied on C2 , C3 , and C4 . We use bilinear interpolation operations and 1 × 1 convolutions to adjust C2 and C4 to the same shape as C3 . We utilize an additional channel attention module for C2 and C3 to help the network automatically adjust the information ratio of different layers and different channels, and adaptively generate information-balanced features. Channel attention is introduced to adjust the weights of both intralayer and interlayer features. Note that C4 is not followed by a channel attention module. It is because C4 , which contains the most semantic features, is selected as the benchmark in the interlayer weight adjustment task. If C4 is assigned channel attention, there is no longer a benchmark feature map. Without a benchmark, the three attention modules, which are trained simultaneously, are too coupled for the weight adjustment task, resulting in slow convergence and low detection accuracy. The fusion process of FFM can be formulated as follows:

F = Conv 1 \times 1 [UpSample (C 4)] + CA (C 3) + C A {Conv 1 \times 1 [Downsample (C 2)]} (1)

View Source

where

denotes the channel attention module in convolutional block attention module (CBAM) [42], UpSample is

2×

bilinear interpolation up-sample, DownSample is

2×

bilinear interpolation down-sample,

Conv1×1

represents 512-D 1

1 convolution layer, and

is the fused feature map.

Due to the lack of local semantics of small targets, there is an urgent need to extract nonlocal contextual information with large receptive fields as important supplements and compensation. In order to catch balanced nonlocal contextual features and local target features, we propose an RFEM based on dense-connected dilated convolution. The structure of RFEM is depicted in Fig. 8, which can also be formulated as follows:

R 1 = R 2 = R 3 = P = DConv [1] 3 \times 3 (F) DConv [2] 3 \times 3 (Conv 1 \times 1 (R 1 © F)) DConv [3] 3 \times 3 (Conv 1 \times 1 (R 1 © R 2 © F)) {R 1 © R 2 © R 3 © F, Conv 1 \times 1 (R 1 © R 2 © R 3 © F), if C out = 2 C if C out \neq 2 C (2) (3) (4) (5)

View Source

where

is the fused feature map from FFM, and

Ri(i=1,2,3)

are the intermediate variables.

DConv[1]3×3

denotes

3×3×C×(C/2)

dilated convolution with dilated rate of 2,

DConv[2]3×3

and

DConv[3]3×3

are

3×3×(C/4)×(C/4)

dilated convolution with dilated rate of 4 and 8, respectively.

Conv1×1

is the 1

1 standard convolution, © represents the channel concate operation, and

is the output feature map. There is an optional 1

1 standard convolution at the end of RFEM, which is used to adjust the channel of output feature map.

Cout

is the output channel of RFEM. If the output channel is

, there is no need to use this 1

1 convolution.

Fig. 8.

Structure of RFEM. H , W , and C are the height, width, and channel of the input feature map, respectively. RFEM maintains the clear dominance of the channel number of the input feature map at each fusion to prevent initial information from being overwhelmed.

Show All

The stride of F and P is 8. When the input image is 500×500 , the resolution of F is 62 ×62 . Regarding the receptive field of F as 1, the receptive fields of R1,R2,R3, and P are expanded to 5, 13, 29, and 29, respectively. The receptive field of P covers nearly half size of F , which contains adequate nonlocal background information. Compared with other popular dilated convolution-based modules, the proposed RFEM owns three innovative characteristics. First, some dilated convolution-based modules, such as atrous spatial pyramid pooling (ASPP) [43], do not deal with the gridding problem of dilated convolution, which can lead to the loss of local information. The dense-connected pyramid structure of RFEM superimposes and reuses the original receptive field and different expanded receptive fields to alleviate the gridding problem. Second, it is worth noting that the design of the number of channels in RFEM is very important. Many dilated convolution-based RFEMs [43], [44], [45] ignore the dominance and hierarchical relationship of feature maps with different receptive fields, thus resulting in the flooding of the initial features. On the contrary, RFEM maintains the clear dominance of the initial information. Third, when the dilated rate increases, the number of effective filtering parameters gradually decreases [43]. Extremely, when the dilated rate is equal to the size of feature map, the convolution degenerates to 1 × 1 convolution. Considering this problem, RFEM expands the receptive field to 29 with a max dilated rate of only 8. However, most of the other modules [43], [44], [45], [46] only adopt one of the parallel structure and cascade structures. For parallel structure, in order to achieve the same expansion size, the required dilated rate is as high as 14, resulting in effective convolution parameters being greatly reduced compared to the proposed RFEM. For cascade structure, the information from the original feature maps in the later-positioned convolution is severely diluted. The proposed RFEM combines the parallel and cascade structure to solve the above problems.

B. Contextual Attention Network

Ocean scenes can be very complex in practical applications, especially near the shore. Background clutter and false alarms will seriously interfere with the detection accuracy. To solve this problem, attention mechanisms are proposed. Current researches are mainly based on unsupervised attention networks, such as [42], [47], [48], [49], and [50]. However, unsupervised attention networks cannot learn for specific targets. In addition, false alarms closely resemble the appearance of ships, so the features of false alarms are prone to be enhanced with real targets simultaneously. Fortunately, the supervised attention networks utilize ground truth to guide the learning process; thus, it is more advantageous to remove false alarms.

Based on the supervised dual-mask attention network proposed in [37], we design a new CA-Net to break the limitation of thin semantic features of small ships and reduce false alarms in infrared images. The design of CA-Net is based on an important observation: as shown in Fig. 9, in a set of detection result patches mixed with targets and false alarms, even humans cannot quickly identify which are targets. Once we include the surrounding areas of the detected results in the scope of observation (shown in Fig. 10), humans can easily distinguish between targets and false alarms. This observation also holds true for the network learning process, where the networks require not only the target information but also the surrounding areas of the target. Therefore, the CA-Net is proposed to strengthen target and contextual information simultaneously and suppress the background clutter.

Fig. 9.

Detection result patches mixed with ship targets and false alarms. Even humans cannot tell which are ship targets in a second. Actually, (a)–(d) are ship targets and (e)–(h) are all false alarms.

Show All

Fig. 10.

Detection result patches with surrounding areas. Compared with patches in Fig. 9, patches with surrounding areas are much easier to be identified: (a)–(d) ship targets and (e)–(h) false alarms.

Show All

CA-Net is a supervised pixel-level attention network. The framework of CA-Net is illustrated in Fig. 11. The labels of CA-Net are two binarized masks, Mask1 and Mask2, with the same resolution as the input feature map. In the two masks, the pixels in foreground are set to 1, and the background is filled with 0. For Mask1, areas of ground-truth bounding boxes are foreground. For Mask2, we regard areas of β times scale of the ground truth as foreground. Under the guidance of Mask1 and Mask2, CA-Net utilizes five 3 × 3 convolution layers with 256 channels followed by one 3 × 3 convolution layer with two channels to construct learned masks M1 and M2 . The loss function of CA-Net will be described in Section IV-D. Then, softmax is employed to rearrange the value interval of pixels in M1 and M2 to [0, 1]. M1 and M2 are the weights added to get the final pixel-level attention mask M . Multiply mask M with P point-wisely to get the enhanced feature map. Moreover, we add a shortcut to prevent network degradation caused by a large number of background pixels being suppressed to 0. The above process can be formulated as follows:

M 1 = M 2 = P' = Conv [1] 3 \times 3 \times 2 (Conv 5 3 \times 3 \times 2561 \times 1 (P)) Conv [2] 3 \times 3 \times 2 (Conv 5 3 \times 3 \times 2561 \times 1 (P)) P \otimes [1 + α 1 \cdot softmax (M 1) + α 2 \cdot softmax (M 2)] (6) (7) (8)

View Source

where

denotes the output feature from BFF-Net.

Conv53×3×256

represents five 3

×3×256

convolution layers,

and

share the same

Conv53×3×256

Conv[1]3×3×2

and

Conv[2]3×3×2

denote different 3

×3×

2 convolution layers for the calculation of

and

, respectively.

⊗

denotes pointwise multiplication.

α1

and

α2

are the tradeoff hyperparameters.

P′

is the output enhanced feature map from CA-Net.

Fig. 11.

Framework of CA-Net. Mask1 and Mask2 guide CA-Net to generate spatial attention mask M , which is operated with input P to obtain the enhanced feature map P′ . Comparing P and P′ , target and context information is obviously strengthened in P′ , and the severe clutter in the background is suppressed effectively.

Show All

CA-Net no longer puts the limited eye on the target itself but expands the sight to the contextual characteristics of the targets’ surrounding environment. As shown in Fig. 11, compared with original input feature map P , information of targets and their surrounding areas is highlighted in P′ , and clutter and false alarms in the other area are suppressed efficiently. Small ships risk being submerged by the background after multiple pooling operations. Luckily, the strengthened contextual features expand the high-response area of the small target, which is beneficial to reducing missed detection of small targets. In addition, supplementing contextual features can alleviate the problem of low classification performance caused by the lack of semantic information of infrared targets.

C. Knowledge-Driven Prediction Head

1) Structure of Prediction Head:

Due to the black-box nature of CNN, the physical interpretability of deep features is poor. Experiments demonstrate that many false alarms cannot be discerned if only use deep features. Considering that the learning process of CNN simulates the human cognitive process, to solve the false alarm problem from the root, we should analyze why humans can detect ships quickly in an infrared image. One of the key advantages of human beings is that we have sufficient prior knowledge of the ships, such as their shape and contrast, where they might appear. These features with clear physical meaning can be modeled by visual features in model-driven methods. Therefore, we combine the model-driven approach with the data-driven approach and design a novel detection head as depicted in Fig. 12.

$Fig. 12. - Workflow of the knowledge-driven prediction head. Inside the left green box, the solid and dashed lines represent RoI align and crop operation, respectively. The colors of the line above RoIs correspond to the line color in the left green box, which reflects the proposal source and operation to get RoIs. Visual features generated from RoI-OIs are served as training labels, and RoI-FMs are input to five 3 $\times $ 3 convolutions followed by a GAP to predict corresponding visual features, which is concated with deep features for the final prediction. The dashed line part only participates in the training stage, and the solid line part is involved in both training and testing stages.$

Fig. 12.

Workflow of the knowledge-driven prediction head. Inside the left green box, the solid and dashed lines represent RoI align and crop operation, respectively. The colors of the line above RoIs correspond to the line color in the left green box, which reflects the proposal source and operation to get RoIs. Visual features generated from RoI-OIs are served as training labels, and RoI-FMs are input to five 3 × 3 convolutions followed by a GAP to predict corresponding visual features, which is concated with deep features for the final prediction. The dashed line part only participates in the training stage, and the solid line part is involved in both training and testing stages.

Show All

As we analyzed in Section IV-B, contextual information is beneficial to identifying ship targets and nonship targets. Therefore, in knowledge-driven prediction head, RoI align is conducted twice to get basic RoI from feature map (Basic RoI-FM) and context RoI from feature map (Context RoI-FM). Basic RoI-FM is the same as classic RoI in traditional two-stage networks. Note that Context RoI-FM is generated from context proposals, which have β (β>1 ) times length and width of the basic proposal. The setting of β corresponds with Section IV-B. Meanwhile, based on basic proposals and context proposals, we crop basic RoI from original image (Basic RoI-OI) and context RoI from original image (Context RoI-OI), respectively.
正如我们在 Section IV-B 中分析的那样，上下文信息有助于识别船只目标和非船只目标。因此，在知识驱动的预测头部，对 RoI 进行两次对齐，以从特征图中获得基本 RoI（Basic RoI-FM）和上下文 RoI（Context RoI-FM）。Basic RoI-FM 与传统两阶段网络中的经典 RoI 相同。请注意，Context RoI-FM 是从上下文建议生成的，其长度和宽度是基本建议的 β ( β>1 ) 倍。 β 的设置与 Section IV-B 相对应。同时，基于基本建议和上下文建议，我们分别从原始图像中裁剪基本 RoI（Basic RoI-OI）和上下文 RoI（Context RoI-OI）。

Visual features computed from Basic RoI-OIs and Context RoI-OIs are served as training labels. Basic RoI-FMs and Context RoI-FMs are input to five 3 × 3 convolution layers followed by a GAP to generate the learned visual features. The visual features utilized in the prediction head are introduced in detail in Section IV-D. In the meantime, basic RoI-FMs are input to Block5 of ResNet followed by a GAP to obtain deep features. Deep features and visual features are concated to get the joint features, which are used in the final class prediction and bounding box regression.

Intuitively, we can directly use the visual features computed on RoI-OIs to combine with deep features. However, experiments demonstrate that introducing some visual features with high computational complexity will greatly slow down the network’s speed. The supervised learning mechanism used in our prediction head can uniformly calculate different visual features with low compute costs while maintaining the validity of visual features. More importantly, the prior knowledge can be passed to all previous feature extraction processes through the back-propagation mechanism. As the bridge between the model-driven and data-driven approaches, the proposed prediction head exploits their respective strengths and diminishes their respective disadvantages. The loss function of knowledge-driven prediction head will be further introduced in Section IV-D.

2) Design of Visual Features:

The visual features should contain sufficient prior knowledge and be capable of distinguishing ship targets from false alarms. In order to better design visual features, we need to analyze the characteristics of false alarms. As displayed in Fig. 13, there are two main categories of false alarms: cloud false alarms and land false alarms. Cloud false alarms can be further divided into strip clouds and point clouds. Land false alarms contain strip highlight buildings, protruding lands, and small islands. Most false alarms are similar in shape to the ships or have similar contrasts to the surrounding background as the ships. Fortunately, there are also many differences in detail between false alarms and ship targets, such as texture and the regularity of shape.

Fig. 13.

Detection results with false alarms. Green bounding boxes are ships that are correctly detected. Red bounding boxes are false alarms. For a clearer view of the false alarms, all false alarms are numbered and zoomed in below the image: (a) strip cloud false alarms; (b) point cloud false alarms; (c) strip highlight buildings; (d) No. 1 false alarm is highlight buildings near the port, and No. 2 false alarm is protruding land; and (e) island false alarms.

Show All

Based on the analyses above, we selected three types of visual features for the network to learn: geometric features, texture features, and contrast features. The computation of all visual features is based on the RoI-OIs. The detailed descriptions are as follows.

The geometric features include rectangle degree, AR, and the number of connected areas. All RoI-OIs should be binarized first before calculating the geometric features. See Table II for formulaic definitions of three geometric features. Geometric features perform best in removing false alarms with irregular or unclear boundaries such as clouds and islands.
The texture features contain HOG features [51] and GLCM-based features [52]. For HOG features, as illustrated in Fig. 14, we first compute nine-bin HOG features of the Basic RoI-OI and Context RoI-OI, then divide the Context RoI-OI into four corner blocks, and calculate HOG features of each block separately. GLCM-based features include six types of features, including dissimilarity, homogeneity, contrast, energy, correlation, and angular second moment (ASM). Table III lists the formulas of six GLCM-based features. All GLCM-based features are calculated in four different directions, that is, 0, (π/4) , (π/2) , and (3π/4) . Texture features have a good effect on false alarms with uneven grayscale distribution.
The contrast feature is designed to describe the contrast between the RoI and the surrounding background. As illustrated in Fig. 14, we define surrounding background B as the ring area after cutting off the Basic RoI-OI from Context RoI-OI. Due to the effects of atmospheric refraction and optical defocusing, the bodies of infrared ships generally exhibit a Gauss-like morphological feature. However, because of the trail parts of ships, traditional contrast algorithms using eight-neighbor grayscale difference or ratio, such as local contrast measure (LCM) [53] and multiscale patch-based contrast measure (MPCM) [54], are no longer suitable for reflecting ship contrast. As shown in Fig. 15(a), MPCM can identify cloud and port false alarms, but the contrast of trailing ships is incorrectly low, and island false alarms cannot be distinguished correctly. In practice, strip buildings near the port, as shown in Fig. 13(d), are easily confused with ships with trail and be detected as false alarms. It is worth noting that the grayscale of the ship’s trailing area usually decreases unevenly, while grayscale of the port buildings is often evenly distributed along a specific direction. Considering the grayscale distribution characteristic and Gauss-like morphological feature of the ship body, regional intensity level (RIL) [55] is a good measure of the grayscale complexity. We propose a local contrast algorithm for infrared ship targets based on RIL, which can effectively distinguish ships and false alarms. The formulaic expression of our contrast feature is as follows:
$K = m T = RIL T = m B = RIL B = k \cdot w \cdot h a, M T = 1 K \sum i = 1 K I T (i) 1 N \sum i = 1 N I T (i) M T - m T, M B = 1 K \sum i = 1 K I B (i) 1 M \sum i = 1 M I B (i) M B - m B, W = RIL 2 T RIL B + ε (9)$ View Source
where T denotes the target RoI-OI area, and B denotes the background area. k is a hyperparameter and is usually set to 5–20. a is another hyperparameter and is usually set to the average area of Context RoI-OIs. w and h are the width and height of B , respectively. MT is the average value of the top K largest pixels in T , and mT is the average value of all pixels in T (likewise for MB and mB ). ε is a very small number close to 0, and the final output W is the contrast feature.

TABLE II Geometric Features

TABLE III Texture Features Based on GLCM

Fig. 14.

Region division strategy for computing HOG features and definition of the surrounding background area B .

Show All

Fig. 15.

Comparison of contrast features of ships and false alarms: (a) MPCM and (b) proposed contrast feature. Green boxes are ships, and red boxes are false alarms. Contrast features greater than threshold are regarded as ships and marked in green, while features less than threshold are considered nonships and marked in red. Thresholds for MPCM and the proposed contrast feature are set to 6.0 and 200.0, respectively.

Show All

Fig. 15(b) displays some calculation results of the proposed contrast feature with k=10 . The contrast feature of false alarms, such as port buildings, clouds, and small islands, is basically kept below 200, and both trailing and nontrailing ships can achieve high contrast values, indicating the effectiveness of the proposed contrast feature. In actual experiments, we set k=6 , 10, 16 to get 3-D contrast features.

To sum up, we get 3-D geometric features, 9×6=54 -D HOG features, 4×6=24 -D GLCM-based features, and 3-D contrast features. The total 84-D visual features will be concated with deep features after normalization to conduct the final prediction. As a little trick, to avoid large value ranges of different features, we multiply the contrast feature and the number of connected domains by 0.002 and 0.2.

D. Loss Function

Multitask learning strategy is adopted to realize end-to- end training. Compared with common two-stage detection networks, there are two more supervised learning tasks in KCPNet: mask generation task in CA-Net and visual feature regression task in knowledge-driven prediction head. With the two new tasks, the loss function of KCPNet is

L = λ 1 N \sum n = 1 N t' n \sum v \in {x, y, w, h} L reg (u' nv, u nv) + λ 2 N \sum n = 1 N L cls (p n, t n) t' n + λ 3 h \times w \sum i h \sum j w [L att (m' [1] i j, m [1] i j) + L att (m' [2] i j, m [2] i j)] + λ 4 N \sum n = 1 N \sum j \in F L reg (f' n j, f n j) (10)

View Source

where

is the total number of proposals generated by RPN.

and

t′n

represent the ground-truth label and predicted label, respectively.

t′n

is a binary value (

t′n=1

means foreground, and

t′n=

0 means background).

denotes the probability distribution of the ship label calculated by the softmax function.

unv

and

u′nv

are the regression vectors of the ground-truth bounding box and predicted bounding box, respectively.

m′[1]ij

and

m[1]ij

represent the ground-truth label and predicted value of mask

at pixel

(i,j)

, respectively.

m′[2]ij

and

m[2]ij

represent the ground-truth label and predicted value of mask

at pixel

(i,j)

, respectively.

denotes the set of visual features,

fnk

and

f′nk

are visual features calculated from RoI-OIs and predicted visual features generated from RoI-FMs, respectively. Hyperparameters

λi(iϵ[1,4])

control the tradeoff between the four training tasks.

Lcls

and

Latt

are the softmax cross entropy loss, and

Lreg

is the smooth L1 loss.

SECTION V.

Experiment

In this section, we first introduce the experimental environment and hyperparameter settings in detail. Then, we conducted groups of ablation experiments to explore the effectiveness and superiority of each network component. Finally, we compare the performance of KCPNet with other state-of-the-art networks on ISDD. The proposed KCPNet achieves the best performance on multiple evaluation metrics.

A. Experimental Settings

1) Experimental Environment:

To be fair, all experiments in this section are conducted on the same server, which has Unbantu 18.04 operation system, CPU Intel1 Core i9-10940X CPU at 3.30 GHz, and GPU RTX TITAN with 24G memory. We implement the proposed KCPNet on Tensorflow framework.

2) Training Strategy and Network Settings:

a) Data augment:

Since the lack of public datasets for ship detection in infrared remote sensing images, we adopt our ISDD as the main test dataset for experiments. ISDD contains 1284 images. To reduce the risk of overfitting under few-shot learning, we apply some online data augmentation strategies, including horizontal flip, vertical flip, and random rotation of 90°, 180°, and 270°.

b) Training strategy:

We train KCPNet with 60k total training iterations, initialize the learning rate of 0.001, and decay weight of 0.0001. The learning rate decays ten times and 100 times at 20k iteration and 50k iteration, respectively. For the weights of different training tasks in loss function equation (10), λi(i=1,2,3) is set to 1, and λ4 is set to 0.1. The choice of low λ4 is to avoid the dominance of excessive smooth-L1 loss of 84-D visual features in multitask training process. In visual feature regression task, we upscale all features by a factor of 5 to make predictions more accurate. Since the regression task is generally performed well while classification tasks face greater challenges, λ2 is doubled after 30 k to make the network more focused on classification task in the later stage of training.

c) Network settings:

ResNet101 is selected as the backbone in experiments. We initialize the backbone parameters with pretrained ResNet101 on ImageNet and freeze parameters in C1 block to accelerate training convergence. The hyperparameters α1 and α2 in (8) of CA-Net are both set to 0.5. In the calculation of GLCM, the gray level is divided into 16 bins, and the distance parameter is set to 1. In RPN, we assign seven ARs {1:1, 1:2, 2:1, 1:3, 3:1, 1:4, 4:1} to four scales {32, 64, 128, 256} anchors, respectively. NMS threshold is set to 0.7. In the training process, a number of top-scoring boxes to keep before and after applying NMS are 12 000 and 2000, and in the testing process, we obtain 600 proposals after applying NMS to 6000 top scoring boxes. Other hyperparameters such as the threshold of positive and negative samples are consistent with [13]. NMS threshold for postprocessing is 0.5.

3) Evaluation Metrics:

We adopt AP@50, AP@50:95, AP_S@50, AP_M@50, and AP_L@50 in widely used COCO metrics as our evaluation metrics. Moreover, in practical applications, we can only select a certain threshold to obtain images of the test results, so we also evaluate Precision, Recall, and F1-Score with confidence threshold of 0.5 and IoU threshold of 0.5. It is worth noting that there are no ships in ISDD that belong to large targets in COCO standards. Thus, we redefine the size division standard. Ship with area in range of [0, 162 ], [162 , 232 ], and [232 , 1e5] is small target, medium target, and large target, respectively. The three scale ships occupy 34.8%, 39.1%, and 26.1% of the total ships.

B. Ablation Studies

In this section, we discuss the effectiveness and superiority of each component in the proposed KCPNet and explore the influence of some detailed network settings. Faster R-CNN with backbone Resnet101 is chosen as the baseline for comparison. Hyperparameters keep consistent across all ablation experiments. The results of ablation studies are listed in Table IV. Fig. 16 displays examples of multiscene performance comparison.

TABLE IV Ablation Studies of Each Component in KCPNet

Fig. 16.

Performance of four networks in ablation studies in five typical scenes of ISDD. The green, yellow, and red bounding boxes represent correctly detected ships, missing ships, and false alarms: (a) baseline (Net1); (b) baseline + BFF-Net (Net2); (c) baseline + BFF-Net + CA-Net (Net5); and (d) proposed KCPNet (Net8). The comparison between (a) and (b) reflects that BFF-Net can effectively improve the recall of small ships. The results of (c) and (d) demonstrate CA-Net and knowledge-driven prediction head further boost the ability to remove false alarms under complex scenes.

Show All

1) Effect of BFF-Net:

BFF-Net fuses features from all backbone layers to a proper resolution for small targets with insignificant scale variances. RFEM in BFF-Net obtains nonlocal information with balanced receptive fields to make up for the weak semantic features of infrared ships. As shown in Table IV, baseline (Net1) only gets AP@50 of 84.53%, Recall of 82.59%, and AP_S of 80.31%. After introducing BFF-Net (Net2), AP@50 increases to 85.65%, and Recall and AP_S raise by 5.00% and 2.01%, respectively. Comparing Fig. 16(a) and (b), BFF-Net effectively improves the recall of small targets. Unfortunately, the increase in Recall sacrifices Precision, which may attribute to the increase of false alarms under the larger feature map. Comparing KCPNet with and without BFF-Net (Net7 and Net8), there are significant improvements on all metrics after introducing BFF-Net, especially AP_S (+3.81%), AP_M(+2.63%), and AP@50 (+2.59%).

To study the superiority of the internal components of the BFF, we replace the proposed RFEM with other popular RFEMs for testing, including ASPP [43], efficient spatial pyramid (ESP) [46], and C5 [44]. FPN is also implemented to compare the performance of other network necks. The experimental results are listed in Table V. Without RFEM, BFF-Net can also improve AP_S@50 by 3.16%, but AP_M@50 and AP_L@50 have a slight decline, which may be caused by the stride decrease of the fused layer. The introduction of RFEM can not only make up for the above shortcomings, but also effectively improve the performance of all aspects. It is obvious that the proposed RFEM achieves the best performance among other popular RFEMs. Thus, BFF-Net is recommended to be used together with RFEM. Compared with FPN, BFF-Net performs better on all metrics except AP_L, reflecting that BFF-Net is more suitable for small targets with slight scale variance.

TABLE V Comparison of Different Network Necks and RFEMs

2) Effect of CA-Net:

Considering the key role of contextual information for small targets, CA-Net is proposed to highlight information of targets and their surrounding areas, and suppress clutter in complex scenes. As shown in Table IV, CA-Net boosts AP@50 from 84.53%(Net1) to 87.74% (Net3) and improves average precision (AP) under all scales significantly. As shown in Fig. 16(b) and (c), CA-Net significantly reduces land false alarms in complex scenes. Compared with Net6, Net8 with CA-Net achieves 7.84% increase in Precision, thus demonstrating CA-Net effectively suppresses severe background clutter and further reduces false alarms in complex scenes. The cooperation of CA-Net and BFF-Net (Net5) raise the AP@50 by 4.54% (from 84.53% to 89.07%) and AP@50:95 by 2.47% (from 39.65% to 42.12%).

Moreover, we conduct a group of comparison experiments with different hyperparameters to figure out the individual influence of components in CA-Net such as Mask1, Mask2, and shortcut. The results are shown in Table VI. It can be observed that there is a sharp performance decrease (−2.27% of AP@50) in CA-Net without shortcut. The main reason lies in the shortcut that can prevent feature degradation in background regions. CA-Net with single M1 (α1=1 , α2=0 ) gets AP@50 of 90.10%, and CA-Net with single M2 (α1=0andα2=1 ) achieves AP@50 of 90.89%, which is very close to the best performance of CA-Net, indicating M2 contributes more to the whole network than M1 . The AP_S of CA-Net with single M2 is even better than CA-Net, which proves that the enhancement of contextual information improves the network’s ability to perceive small targets. Therefore, in the next two experiments, we set the weight of M2 higher than M1 to expect better performance, but the results of most evaluation metrics are not as good as CA-Net with single M1 . The above fact indicates that M1 also plays an important role in CA-Net, and it is reasonable to make a balance between M1 and M2 . Finally, we select α1=α2=0.5 as the mask weights. The influence of hyperparameter β , which controls the scale of contextual regions, is also explored. Experiments show that β=2 is the most suitable hyperparameter.
此外，我们进行了一组使用不同超参数的对比实验，以确定 CA-Net 中各个组件（例如 Mask1、Mask2 和捷径）的个体影响。结果显示在 Table VI 。可以观察到，没有捷径的 CA-Net 性能急剧下降（AP@50 下降了-2.27%）。主要原因在于捷径可以防止背景区域特征退化。具有单个 M1 （ α1=1 ， α2=0 ）的 CA-Net 的 AP@50 为 90.10%，而具有单个 M2 （ α1=0andα2=1 ）的 CA-Net 的 AP@50 达到 90.89%，这非常接近 CA-Net 的最佳性能，表明 M2 对整个网络的贡献大于 M1 。具有单个 M2 的 CA-Net 的 AP _S 甚至优于 CA-Net 本身，这证明了上下文信息的增强提高了网络感知小目标的能力。因此，在接下来的两个实验中，我们将 M2 的权重设置得高于 M1 ，以期获得更好的性能，但大多数评估指标的结果不如具有单个 M1 的 CA-Net。以上事实表明 M1 在 CA-Net 中也扮演着重要角色，并且在 M1 和 M2 之间取得平衡是合理的。最后，我们选择 α1=α2=0.5 作为掩码权重。超参数 β （控制上下文区域的规模）的影响也被探索了。实验表明 β=2 是最合适的超参数。

TABLE VI Influence of Shortcut Structure and Hyperparameters in CA-Net
表 VI CA-Net 中快捷方式结构和超参数的影响

With the hyperparameter setting above, some visualization results of CA-Net are displayed in Fig. 17. It can be observed that the mask M generated by CA-Net can well match the ground truth. Background clutter in complex scenes is effectively suppressed, and information of ships and their contextual areas is significantly enhanced.
使用上述超参数设置，CA-Net 的一些可视化结果显示在 Fig. 17 。可以观察到，CA-Net 生成的掩码 M 与 ground truth 很好地匹配。复杂场景中的背景杂乱被有效抑制，船舶及其上下文区域的信息得到显著增强。

Fig. 17.

Visualization results of CA-Net: (a) original input image; (b) input feature map P of CA-Net; (c) attention mask M obtained by CA-Net; and (d) output feature map P′ of CA-Net.

Show All

Furthermore, we are also curious about what happens if we add an M3 that is larger than M2 . Use α3 and β′ to denote the weight and scale of M3 , respectively. Table VI shows that β=2 performs better than β=3 , indicating masks should not be too large. Therefore, in this experiment, we set different mask weights with β′=3 . The results are shown in Table VII. Except that the AP_L under α1=α2=α3=0.5 is higher than the proposed CA-Net, the overall performance after introducing M3 is slightly declined. Thus, two masks are enough for infrared ship detection task.

TABLE VII Effect of Introducing a Larger Mask in CA-Net

3) Effect of Knowledge-Driven:

Inspired by human cognitive processes, knowledge-driven prediction head introduces visual features as prior knowledge, breaking the limitation of black-box properties of CNN to a certain extent. As shown in Table IV, knowledge-driven prediction head improves AP@50 from 84.53% (Net1) to 86.86% (Net4). In addition, the Precision of Net5 raises by 4.28% after introducing knowledge-driven prediction head. Fig. 16(d) demonstrates that knowledge-driven prediction head improves the ability of the network to identify false alarms. It is worth noting that networks with only one or two subnetworks fail to achieve AP_S higher than 85%, but the proposed KCPNet with all three components booms AP_S to 87.31%, indicating that the interaction of the three subnetworks can maximize the detection performance of small targets.
受人类认知过程的启发，知识驱动的预测头部将视觉特征作为先验知识引入，在一定程度上打破了 CNN 黑盒特性的限制。如 Table IV 所示，知识驱动的预测头部将 AP@50 从 84.53%（Net1）提高到 86.86%（Net4）。此外，引入知识驱动的预测头部后，Net5 的精确度提高了 4.28%。 Fig. 16(d) 表明，知识驱动的预测头部提高了网络识别误报的能力。值得注意的是，仅包含一个或两个子网络的网络未能达到 AP _S 高于 85%的水平，但所提出的包含所有三个组件的 KCPNet 将 AP _S 提升至 87.31%，表明三个子网络的交互作用可以最大化小型目标的检测性能。

Reasonable design of visual features is the key point to the knowledge-driven detection head. We experiment the network performance under 11 different combinations of visual features and three normalization strategies. In Table VIII, G1–G11 display the influence of visual feature design. The comparison between G1 and G3 reflects that increase in the variety of features does not lead to higher performance. On the contrary, appropriate feature selection is more important for optimal performance. G4–G6 experiment three combinations of texture features. GLCM gets higher AP_S and AP@50, while GLCM + Hog achieves higher AP@50:95. To better interact with HoG features from Contextual RoIs, we choose GLCM + Hog as texture features in KCPNet. G7 and G8 show that dividing the Contextual RoI into four subblocks to compute additional HOG features can further improve performance. G9–G11 verify the effectiveness of the proposed contrast feature.
合理的视觉特征设计是知识驱动检测头的关键点。我们实验了网络性能，使用了 11 种不同的视觉特征组合和 3 种归一化策略。在 Table VIII ，G1–G11 展示了视觉特征设计的影响。G1 和 G3 的比较表明，特征多样性的增加并不导致更高的性能。相反，合适的特征选择对于最佳性能更重要。G4–G6 实验了三种纹理特征组合。GLCM 获得了更高的 AP _S 和 AP@50，而 GLCM + Hog 获得了更高的 AP@50:95。为了更好地与来自语境 RoI 的 HoG 特征交互，我们在 KCPNet 中选择了 GLCM + Hog 作为纹理特征。G7 和 G8 表明，将语境 RoI 分成四个子块来计算额外的 HOG 特征可以进一步提高性能。G9–G11 验证了所提出的对比特征的有效性。

TABLE VIII Comparison of Different Visual Feature Designs in Knowledge-Driven Prediction Head
表 VIII 知识驱动预测头部不同视觉特征设计的比较

G11–G13 in Table VIII explore the effect of whether and where use the normalization. Without normalization, G12 only achieves AP@50 of 89.96% (−1.03%) and AP_S of 85.52% (−1.79%), demonstrating the key role of normalization. G13 conducts normalization before prediction, which means using normalized features as training labels. The performance of G13 is slightly inferior to G11, which indicates that the location of normalization also has a certain impact on the network.
G11–G13 在 Table VIII 中探索了归一化使用与否以及使用位置的影响。未进行归一化，G12 仅达到 AP@50 89.96% (−1.03%) 和 AP _S 85.52% (−1.79%)，这证明了归一化的关键作用。G13 在预测前进行归一化，这意味着使用归一化特征作为训练标签。G13 的性能略逊于 G11，这表明归一化的位置也对网络有一定的影响。

G14† tests the performance of directly combining the visual features from RoI-OIs and deep features without supervised learning. We find that, in G11, although the predicted features do not learn every visual feature precisely, the overall performance of G11 is still slightly superior to G14† . The observation above demonstrates that the back-propagation of knowledge throughout the network is more crucial than simply fusing accurate visual features.
G14† 测试了直接组合 RoI-OIs 的视觉特征和深度特征而不使用监督学习的性能。我们发现，在 G11 中，尽管预测的特征没有精确地学习每个视觉特征，但 G11 的整体性能仍然略优于 G14† 。上述观察表明，知识在整个网络中的反向传播比简单地融合精确的视觉特征更重要。

C. Comparison With the State-of-the-Art
与现有技术的比较

To impartially evaluate the performance of KCPNet, we implement eight state-of-the-art detection networks on ISDD for comparison, including Faster RCNN [13], SSD [15], CenterNet [16], RetinaNet [25], YOLO [56], FCOS [57], EfficientDet [58], Cascade RCNN [59], and Sparse RCNN [60]. For all experiments, hyperparameters such as NMS threshold and RPN parameters in two-stage networks keep consistent with KCPNet. To be fair, except for YOLO, EfficientDet, and CenterNet, which develop their new backbones, we implement all networks above with RestNet101. We adjust training settings such as anchor settings and training epochs for several networks to exert their best performance on ship detection task. We implement RetinaNet, YOLO, Centernet, and Sparse RCNN based on Pytorch because the accuracy performance of their Tensorflow versions is not satisfactory. It should be clarified that different deep learning frameworks (Tensorflow/Pytorch) may affect the speed and memory comparison to some extent. Table IX∑RRR displays the performance comparison. It can be observed that the proposed∑RRR KCPNet achieves the best performance. The speed of KCPNet is comparable among∑RRR two-stage networks. The model size of KCPNet only increased by∑RRR 8.8M over baseline Faster RCNN. Fig. 18 plots the P–R curve under COCO evaluation metrics for further comparison. The P–R curve of KCPNet occupies the rightmost and uppermost positions.
为了公正地评估 KCPNet 的性能，我们在 ISDD 上实现了八种最先进的检测网络进行比较，包括 Faster RCNN [13] ，SSD [15] ，CenterNet [16] ，RetinaNet [25] ，YOLO [56] ，FCOS [57] ，EfficientDet [58] ，Cascade RCNN [59] 和 Sparse RCNN [60] 。所有实验中，两阶段网络中的 NMS 阈值和 RPN 参数与 KCPNet 保持一致。为了公平起见，除了 YOLO、EfficientDet 和 CenterNet（它们开发了新的主干网络）之外，我们所有上述网络都使用 ResNet101。我们调整了训练设置，例如锚点设置和训练轮数，以使几种网络在船舶检测任务上达到最佳性能。我们基于 Pytorch 实现 RetinaNet、YOLO、CenterNet 和 Sparse RCNN，因为它们的 TensorFlow 版本的准确性性能不够理想。应该明确的是，不同的深度学习框架（Tensorflow/Pytorch）在一定程度上会影响速度和内存比较。 Table IX ∑RRR 显示了性能比较。可以观察到，所提出的 ∑RRR KCPNet 实现了最佳性能。 KCPNet 的速度在 ∑RRR 个两阶段网络中相当。KCPNet 的模型大小仅比基线 Faster RCNN 增加了 ∑RRR 8.8M 。 Fig. 18 图显示了在 COCO 评估指标下的 P–R 曲线，以进行进一步比较。KCPNet 的 P–R 曲线占据了最右上方位置。

TABLE IX Performance Comparison of the State-of-the-Art Networks on ISDD
表 IX ISDD 上最先进网络的性能比较

Fig. 18. - P–R curves of state-of-the-art detection networks on ISDD with IoU threshold of 0.5.

Fig. 18. 图 18。

P–R curves of state-of-the-art detection networks on ISDD with IoU threshold of 0.5.
先进检测网络在 ISDD 数据集上，IoU 阈值为 0.5 时的 P–R 曲线。

Show All

D. Discussion 讨论

To explore how the introduced knowledge works in the networks, we carefully study the learning process of visual features. We find that the knowledge-driven prediction head does not learn every visual feature precisely. In addition, as shown in Fig. 19, the regression loss of visual features is somewhat oscillating, which may affect the learning of other tasks. The main reason lies in smooth-L1 loss may not meet the requirement of high-precision regression of high-dimension features. Thus, in our future work, we will further design the loss functions which are more suitable for visual feature regression task. Another possible solution is to quantify visual features into discrete quantitative levels, turning the regression task into a classification task, which may converge more stably in training.
为了探究引入的知识如何在网络中发挥作用，我们仔细研究了视觉特征的学习过程。我们发现，知识驱动的预测头并不能精确学习所有视觉特征。此外，如 Fig. 19 所示，视觉特征的回归损失存在一定程度的震荡，这可能会影响其他任务的学习。主要原因在于平滑 L1 损失可能无法满足高维特征的高精度回归需求。因此，在未来的工作中，我们将进一步设计更适合视觉特征回归任务的损失函数。另一种可能的解决方案是将视觉特征量化为离散的定量级别，将回归任务转换为分类任务，这可能在训练中更稳定地收敛。

Fig. 19. - Loss curve of visual feature regression task.

Fig. 19. 图 19.

Loss curve of visual feature regression task.
视觉特征回归任务的损失曲线。

Show All

What is more, the proposed KCPNet passes the knowledge of visual features to each layer of the entire network through back-propagation. However, the knowledge is not as effective in low-layer features as in high-layer features. Therefore, we consider truncating the back-propagation process of visual feature regression at suitable layers to optimize the learning process of visual features and improve the efficiency of back-propagation.
此外，所提出的 KCPNet 通过反向传播将视觉特征知识传递到整个网络的每一层。然而，在低层特征中，该知识不如高层特征有效。因此，我们考虑在合适的层截断视觉特征回归的反向传播过程，以优化视觉特征的学习过程并提高反向传播的效率。

SECTION VI. 第六部分。

Conclusion 结论

In this article, we proposed a KCPNet for infrared ship detection under complex ocean environment. A BFF-Net was designed to improve the recall and precision of all scale ships, especially small ships, through balancing semantic and location information and generating nonlocal features with balanced receptive fields. Then, to cope with severe clutter in complex scenes and thin semantic information of infrared ships, we proposed a pixel-level attention network CA-Net to highlight the targets and their contextual information, and suppress background clutter simultaneously. What is more, we modeled handcraft visual features into prediction head to drive the prior knowledge propagandizing throughout the whole network, which diminishes the false alarms and further boosts the detection performance. In addition, as far as we knew, ISDD published in this article is the first public benchmark for infrared ship detection. Extensive experiments demonstrate that our KCPNet achieves state-of-the-art performance on ISDD. In future work, we will further explore more efficient models to disseminate knowledge in data-driven approaches. In addition, ISDD will keep updating with more images and scenes that are closer to practical applications.
本文提出了一种用于复杂海洋环境下红外船舶检测的 KCPNet。设计了一种 BFF-Net，通过平衡语义和位置信息，生成具有平衡感受野的非局部特征，从而提高所有尺度船舶，特别是小型船舶的召回率和精确率。此外，为了应对复杂场景中的严重杂波和红外船舶的薄弱语义信息，我们提出了一种像素级注意力网络 CA-Net，同时突出目标及其上下文信息，并抑制背景杂波。此外，我们将手工制作的视觉特征建模到预测头中，以驱动先验知识在整个网络中传播，从而减少误报并进一步提高检测性能。据我们所知，本文中发布的 ISDD 是第一个公开的红外船舶检测基准。大量的实验表明，我们的 KCPNet 在 ISDD 上达到了最先进的性能。未来工作中，我们将进一步探索更有效率的模型，以在数据驱动方法中传播知识。此外，ISDD 将继续更新更多更贴近实际应用的图像和场景。

ACKNOWLEDGMENT 致谢

The authors would like to thank the editor and the anonymous reviewers who gave constructive comments and helped to improve the quality of this article. The same appreciation goes to the published code of Yang’s models, Zhou’s model, and Luo’s model for comparison.
作者感谢编辑和匿名审稿人提出的建设性意见，帮助提高了本文的质量。同样的感谢也给予了杨模型、周模型和罗模型的公开代码，用于比较。

Fuzhou University

Fuzhou University

KCPNet: Knowledge-Driven Context Perception Networks for Ship Detection in Infrared ImageryKCPNet：用于红外图像中船舶检测的知识驱动上下文感知网络

Alerts

Abstract:

Metadata

Abstract: 摘要：

ISSN Information: ISSN 信息：

Funding Agency: 资助机构：

Introduction 介绍

Related Work 相关工作

A. Model-Driven Schemes 基于模型的方案

B. Deep Learning-Based Schemes

Infrared Ship Detection Dataset

A. Motivation

B. Collection and Annotations of ISDD

C. Properties of ISDD

Methodology

A. Balanced Feature Fusion Network

B. Contextual Attention Network

C. Knowledge-Driven Prediction Head

1) Structure of Prediction Head:

2) Design of Visual Features:

D. Loss Function

Experiment

A. Experimental Settings

1) Experimental Environment:

2) Training Strategy and Network Settings:

a) Data augment:

b) Training strategy:

c) Network settings:

3) Evaluation Metrics:

B. Ablation Studies

1) Effect of BFF-Net:

2) Effect of CA-Net:

3) Effect of Knowledge-Driven:

C. Comparison With the State-of-the-Art与现有技术的比较

D. Discussion 讨论

Conclusion 结论

ACKNOWLEDGMENT 致谢

References

IEEE Account

Purchase Details

Profile Information

Need Help?

KCPNet: Knowledge-Driven Context Perception Networks for Ship Detection in Infrared Imagery
KCPNet：用于红外图像中船舶检测的知识驱动上下文感知网络

C. Comparison With the State-of-the-Art
与现有技术的比较