Introduction 介绍
In decades, more and more attention has been paid to marine resource development, space utilization, and environmental protection. As the dominant carrier of marine transportation, ships are the key targets in marine inspection. Ship detection in remote sensing images is of great strategic significance in civil and military fields such as marine traffic management, marine rescue, smuggling supervision, and national security. Unfortunately, in practical applications, the ocean environment can be seriously complex and contain various false alarms, including buildings on land, thick clouds, and severe weather such as strong winds and waves, and dense fog. Therefore, there is an urgent need to develop accurate and robust ship detection techniques under complex ocean environment.
几十年来,人们越来越关注海洋资源开发、空间利用和环境保护。作为海洋运输的主要载体,船舶是海洋检验的关键目标。遥感图像中的船舶检测在海洋交通管理、海上救援、走私监管以及国防安全等民用和军事领域具有重要的战略意义。然而,在实际应用中,海洋环境可能非常复杂,包含各种误报,包括陆地建筑物、厚云层以及强风、巨浪等恶劣天气和浓雾。因此,迫切需要在复杂海洋环境下开发准确可靠的船舶检测技术。
As shown in Fig. 1, there are three main data sources for ship detection tasks: visible images, synthetic aperture radar (SAR) images, and infrared images. Visible images provide rich semantic information but cannot work at night and in extreme weather. Fortunately, SAR images can work around the clock. Thus, SAR image is an important complement for ship detection. Compared with visible images and SAR images, infrared imaging detection in certain sensitive bands has better capabilities of detection, positioning, and identification, which are crucial in military applications. In addition, infrared images play an important role in all-weather and long-distance detection technology, especially in space-based detection applications. Therefore, infrared images are indispensable data sources for ship detection task. However, current ship detection researches [1], [2], [3], [4] are mainly based on visible images and SAR images. Few researches focus on infrared images, let alone the pursuit of high detection performance. One of the reasons for this research blind spot is the lack of public infrared datasets due to the confidentiality and inaccessibility of infrared images. Therefore, it is necessary to construct a new benchmark and study high-performance and robust infrared ship detection.
如 Fig. 1 所示,船舶检测任务主要有三个数据源:可见光图像、合成孔径雷达(SAR)图像和红外图像。可见光图像提供丰富的语义信息,但无法在夜间和极端天气下工作。幸运的是,SAR 图像可以全天候工作。因此,SAR 图像对于船舶检测来说是一个重要的补充。与可见光图像和 SAR 图像相比,红外成像检测在某些敏感波段具有更好的检测、定位和识别能力,这对于军事应用至关重要。此外,红外图像在全天候和远距离检测技术中发挥着重要作用,尤其是在基于空间的检测应用中。因此,红外图像对于船舶检测任务来说是不可或缺的数据源。然而,当前的船舶检测研究 [1] , [2] , [3] , [4] 主要基于可见光图像和 SAR 图像。很少有研究关注红外图像,更不用说追求高检测性能了。造成这种研究盲点的其中一个原因是缺乏公开的红外数据集,这是由于红外图像的保密性和不可访问性。 因此,有必要构建新的基准,并研究高性能且鲁棒的红外舰船检测。
Three main data sources for the practical ship detection tasks: (a) visible images; (b) SAR images; and (c) infrared images.
三个主要的数据来源用于实际船舶检测任务:(a) 可见光图像;(b) SAR 图像;和 (c) 红外图像。
In contrast to object detection in natural scenes, the infrared ship detection task has many challenges in the following respects.
与自然场景中的目标检测相比,红外船舶检测任务在以下方面面临许多挑战。
Low-resolution and single-channel information of infrared images.
红外图像的低分辨率和单通道信息。Complex scenes and weather. Grayscale values of the ocean surface and land present bipolarity [5] due to the temperature difference between day and night (shown in Fig. 2). The contrast between ships and the background is diverse because of the thin cloud and temperature change. Moreover, thick clouds and strip-shaped buildings on the land are prone to become false alarms.
复杂的场景和天气。海洋表面和陆地的灰度值由于昼夜温差(见 Fig. 2 )而呈现两极分化 [5] 。由于薄云和温度变化,船只与背景的对比差异很大。此外,陆地上厚云和条状建筑容易产生误报。Inherent characteristics of ship targets. Ships in infrared remote sensing images are very small and lack of semantic features. Ships parked near the coast are easily submerged by land.
船舶目标的固有特征。红外遥感图像中的船舶非常小,缺乏语义特征。靠近海岸停泊的船只很容易被陆地淹没。
Bipolarity of grayscale values: (a) during the day, the land is bright and the sea is dark and (b) at night, the land is dark and the sea is bright.
灰度值的双重性:(a)白天,陆地明亮,海洋黑暗;(b)夜晚,陆地黑暗,海洋明亮。
Current detection approaches can be preliminarily divided into two categories: conventional detection methods based on visual feature modeling and deep learning-based algorithms. The former is model-driven approaches, which rely on expert prior knowledge to design handcraft features [6], [7]. In decades, model-driven approaches have achieved great breakthroughs in infrared small target detection [8], [9], [10], [11]. However, the robustness under complex scenes is unsatisfactory because the recognition accuracy is limited by visual feature representation and manual parameter tuning. The latter are data-driven algorithms based on convolutional neural networks (CNNs) [12], due to high efficiency and stability. Many excellent networks, such as Faster RCNN [13], YOLO [14], SSD [15], and CenterNet [16], perform well in natural detection scenes. However, due to the black-box properties of CNN, it is not easy to make targeted improvements in specific tasks. For infrared ship detection, previous networks show various limitations such as: 1) many false alarms cannot be identified only with deep features and 2) after multiple pooling operations, the information of small ships is greatly lost or even submerged in high-layer feature maps.
当前检测方法大致可分为两类:基于视觉特征建模的传统检测方法和基于深度学习的算法。前者是模型驱动方法,依赖专家先验知识来设计人工特征 [6] , [7] 。数十年来,模型驱动方法在红外小目标检测方面取得了重大突破 [8] , [9] , [10] , [11] 。然而,在复杂场景下的鲁棒性令人不满意,因为识别精度受限于视觉特征表示和手动参数调整。后者是基于卷积神经网络 (CNNs) 的数据驱动算法 [12] ,由于其高效率和稳定性。许多优秀的网络,例如 Faster RCNN [13] ,YOLO [14] ,SSD [15] 和 CenterNet [16] ,在自然场景检测中表现良好。然而,由于 CNN 的黑箱特性,在特定任务中难以进行有针对性的改进。 对于红外舰船检测,之前的网络显示各种局限性,例如:1) 仅凭深度特征无法识别许多误报;2) 多次池化操作后,小船的信息严重丢失,甚至淹没在高层特征图中。
To address the above challenges, we propose a knowledge-driven context perception network (KCPNet), which is an end-to- end two-stage network. The design of KCPNet is inspired by our previous work [34] on ship detection in optical images of visible light band, further solving the missed detection of small ships with weak semantics and false alarm interference in low-quality infrared imagery. KCPNet combines a novel feature fusion network with receptive field expand modules to improve the recall of small ships, proposes an efficient attention mechanism for complex scenes, and introduces well-designed visual features as prior knowledge to drive the prediction head for false alarm removal. KCPNet achieves the state-of-the-art performance on the first public infrared ship detection benchmark called infrared ship detection dataset (ISDD) published in this article. The main contributions of this study are as follows.
为了解决上述挑战,我们提出了一种知识驱动的上下文感知网络 (KCPNet),它是一个端到端两阶段网络。KCPNet 的设计灵感来源于我们之前在可见光波段光学图像中船舶检测方面的研究 [34] ,进一步解决了低质量红外图像中小型船舶的漏检、弱语义和虚警干扰问题。KCPNet 结合了一种新颖的特征融合网络和感受野扩展模块,以提高小型船舶的召回率;提出了一种用于复杂场景的有效注意力机制;并引入精心设计的视觉特征作为先验知识来驱动预测头部以消除虚警。KCPNet 在本文中发布的首个公开红外船舶检测基准数据集 (ISDD) 上取得了最先进的性能。这项研究的主要贡献如下。
Aiming at the small scale of ships, a balanced feature fusion network (BFF-Net) is proposed to ensure the efficient information transmission of small targets and expand receptive fields by constructing balanced local and nonlocal features.
针对小型船舶,提出了一种平衡特征融合网络(BFF-Net),以确保小型目标的有效信息传递,并通过构建平衡的局部和非局部特征来扩展感受野。Considering the thin semantic information of infrared targets and false alarms in complex scenes, we design a pixel-level contextual attention network (CA-Net) to strengthen target and contextual information simultaneously and suppress severe clutter in complex scenes.
考虑到复杂场景中红外目标和虚警的语义信息稀疏,我们设计了一种像素级上下文注意力网络 (CA-Net),以同时增强目标和上下文信息,并抑制复杂场景中的严重杂波。To further reduce false alarms, a novel knowledge-driven prediction head integrates the well-designed visual features as prior knowledge through a supervised regression branch. The prior knowledge can be back-propagated to the entire network, and the learned features with strong physical interpretability can guide the final prediction jointly.
为了进一步减少误报,一个新颖的知识驱动预测头将精心设计的视觉特征作为先验知识整合到监督回归分支中。先验知识可以反向传播到整个网络,学习到的具有强物理可解释性的特征可以共同指导最终的预测。To advance the research of ship detection in infrared images, we construct a new dataset called ISDD, which contains 1284 infrared images with 3061 ship instances. As far as we know, ISDD is the first public benchmark for infrared ship detection.
为了推进红外图像中船舶检测的研究,我们构建了一个名为 ISDD 的新数据集,其中包含 1284 张红外图像和 3061 个船舶实例。据我们所知,ISDD 是第一个公开的红外船舶检测基准。
The rest of this article is organized as follows. Section II briefly introduces the model-driven ship detection schemes and the deep learning-based ship detection networks. A detailed description of the proposed dataset ISDD is given in Section III. Section IV contains details of the proposed framework design. Section V analyzes the experimental results based on the ablation study and comparison with the state-of-the-art approaches. Section VI presents the conclusions.
本文的其余部分组织如下。 Section II 简要介绍了基于模型的船舶检测方案和基于深度学习的船舶检测网络。 Section III 中详细描述了提出的数据集 ISDD。 Section IV 包含所提出框架设计的细节。 Section V 基于消融研究和与最先进方法的比较分析了实验结果。 Section VI 总结了结论。
Related Work 相关工作
A. Model-Driven Schemes 基于模型的方案
Model-driven schemes can be further divided into two categories: feature modeling-based methods and visual saliency-based methods. Both are required to segment the marine area to eliminate the false alarms in land area and reduce the computational complexity.
基于模型的方案可进一步分为两类:基于特征建模的方法和基于视觉显著性方法。两者都需要分割海洋区域以消除陆地区域的误报并降低计算复杂度。
Feature modeling-based methods follow the detection pattern of “candidate region proposal + physical feature extraction + classifier.” How to select or design the visual feature is decisive to the detection performance. The commonly used visual features include geometric features, edge features, contrast features, and texture features. Xia et al. [17] introduced a dynamic model to fuse the geometric features of images and employed support vector machine (SVM) to detect ships. Krizhevsky et al. [18] took means, variances, and wavelet changes as feature descriptors. Aiming at the high aspect ratio (AR) rectangular shape of ships, Qi et al. [19] designed a novel descriptor called ship histogram of oriented gradient (S-HOG) to characterize the gradient symmetry of ship sides.
基于特征建模的方法遵循“候选区域建议 + 物理特征提取 + 分类器”的检测模式。如何选择或设计视觉特征对检测性能至关重要。常用的视觉特征包括几何特征、边缘特征、对比度特征和纹理特征。Xia 等人 [17] 引入了一种动态模型来融合图像的几何特征,并使用支持向量机 (SVM) 来检测船舶。Krizhevsky 等人 [18] 将均值、方差和波浪变化作为特征描述符。针对船舶的高长宽比 (AR) 矩形形状,Qi 等人 [19] 设计了一种名为船舶方向梯度直方图 (S-HOG) 的新描述符,以表征船舶侧面的梯度对称性。
Visual saliency-based methods simulate the human brain to focus on the salient regions quickly and promptly locate and perceive the targets in complex scenarios. There are two categories of visual saliency-based models: spatial saliency models and frequency saliency models. Itti et al. [20] proposed a ground-breaking spatial saliency model, which constructs the final saliency map based on intensity, color, and orientation. Harel et al. [21] defined a Markov chain-based saliency model and treated the balanced distribution over map locations as saliency values. Frequency saliency is defined in the Fourier spectrum or Gabor spectrum. Guo et al. [22] adopted the spectral residual [23] to obtain the initial target curves. Based on the entropy information, Xu et al. [24] modeled a combined saliency structure with self-adaptive weights.
基于视觉显著性的方法模拟人脑,快速关注显著区域,迅速定位和感知复杂场景中的目标。视觉显著性模型分为两类:空间显著性模型和频率显著性模型。Itti 等人 [20] 提出了一个开创性的空间显著性模型,该模型基于强度、颜色和方向构建最终显著性图。Harel 等人 [21] 定义了一个基于马尔可夫链的显著性模型,并将地图位置上的平衡分布视为显著性值。频率显著性在傅里叶谱或 Gabor 谱中定义。Guo 等人 [22] 采用频谱残差 [23] 来获得初始目标曲线。基于熵信息,Xu 等人 [24] 建立了一个具有自适应权重的组合显著性结构。
Each handcrafted feature in model-driven schemes has strong physical meaning and rich local or global semantic information. Unfortunately, the detection performance relies on parameter tuning and lacks robustness in complex marine scenarios. Thus, model-driven ship detection methods reach satisfactory performance only in high-quality images with simple scenes.
每个手工制作的模型驱动方案中的特征都具有强烈的物理意义和丰富的局部或全局语义信息。不幸的是,检测性能依赖于参数调整,并且在复杂的海洋场景中缺乏鲁棒性。因此,模型驱动的船舶检测方法仅在高质量、场景简单的图像中达到令人满意的性能。
B. Deep Learning-Based Schemes
In recent years, the boom of compute capability of computers has prompted the rapid development of data-driven algorithms. As a typical representative, deep learning-based approaches have achieved excellent performance in computer vision field. In line with whether to propose candidate regions, object detection networks can be classified into two categories: single-stage networks and two-stage networks. Single-stage networks represented by YOLO [14], SSD [15], and RetinaNet [25] have low computational complexity and high speed due to the omission of the region proposal step. Two-stage networks such as Faster RCNN [13] tend to have higher accuracy and are not bothered with the imbalance between positive and negative samples.
For challenges in ship detection task, many researchers pay efforts to make effective improvements based on the above networks. For higher accuracy and robustness in large-scale infrared images, Zhou et al. [26] proposed a simple one-stage ship detection network to learn joint features from multiresolution infrared images. As one of the few high-quality infrared ship detection studies, Wang et al. [5] employed spectrum characteristics for coarse ship detection and designed a light CNN for fine detection, thus achieving competitive detection results under limited computation and small storage space. Many ship detection networks based on visible light or SAR images are also inspiring. Multilayer feature fusion structures attracted the attention of many researchers to improve the recall of small ships. For example, Jiao et al. [27] proposed a dense feature fusion network, finding that the retention of low-level features can greatly increase the recall of small ships. Li et al. [28] introduced an attention mechanism into the feature fusion process, further enhancing the efficiency of information passing. In addition, attention mechanisms were also adopted to reduce false alarms against complex scenes. Fu et al. [4] proposed an anchor-free network with attention-guided balanced pyramid (ABP) to extract salient features in complex scenes adaptively and efficiently. Cui et al. [29] introduced a spatial shuffle-group enhance (SSE) attention module into CenterNet to achieve higher accuracy on ship detection in large-scale SAR images. In addition, some studies adopted contextual information to enhance the networks’ ability to identify false alarms. For example, Kang et al. [30] concated region of interest (RoI) features and contextual features around RoIs in prediction head to reduce false alarms. Li et al. [31] improved YOLOv5 with dilated convolution modules to enlarged receptive fields for better robustness in complex scenes.
Most of the above ship detection researches are conducted on visible images or SAR images, while the researches on infrared images are quite rare. The main obstacle is the lack of datasets. Moreover, due to the poor physical interpretability of neural networks, it is difficult to make targeted improvements to particular problems. Thus, current detection networks do not well address the specific challenges of ship detection in infrared images. Therefore, here comes a new research idea: introduce the knowledge in model-driven approaches to data-driven networks to combine their strengths and complement their weaknesses. There have been some novel attempts to combine data-driven and model-driven in other fields. For example, Wang et al. [32] detected hard exudate based on multifeature joint representation for diabetic retinopathy screening. Arvor et al. [33] proposed a knowledge-driven scheme for land cover mapping over large areas. Wang et al. [34] designed a model-data-knowledge-driven and deep learning (MDK-DL) method for land surface temperature retrieval. However, these approaches employ model-driven and data-driven modules in separate stages of the entire framework, which are not end-to-end. This article explores efficient solutions for combining data-driven approaches and model-driven networks in an end-to-end way for the infrared ship detection task.
Infrared Ship Detection Dataset
A. Motivation
Datasets play a critical role in data-driven models. However, as far as we know, there is no public dataset for ship detection in infrared images. Previous infrared ship detection studies adopt two main strategies to deal with the absence of public datasets: using small-scale private datasets or simulating the infrared images based on visible remote sensing images. Most of the private datasets contain no more than 500 images. These private datasets are difficult to obtain, and the dataset scales are not suitable for the study in this article. As for the simulated datasets, there are three main approaches to simulating visible images into infrared images: generative adversarial networks (GANs) [35], variational autoencoder (VAE) [36], and traditional image processing ways. No matter which simulation approach is utilized, the simulated images cannot imitate various environmental factors under infrared imaging. To sum up, the dataset strategies adopted in previous studies cannot meet the needs of high-performance infrared ship detection research. Therefore, we develop the ISDD as a benchmark for research in this article. More importantly, hope ISDD to be helpful to other researchers in related fields.
B. Collection and Annotations of ISDD
As mentioned above, synthetic infrared remote sensing images are far inferior to the real collected ones. Our ISSD collects real infrared remote sensing images taken by the Landsat8 satellite, which carries the Operational Land Imager (OLI) with nine imaging bands and Thermal Infrared Sensor (TIRS) with two thermal infrared imaging bands. We fuse three OLI bands of Band 7, Band 5, and Band 4 to obtain short-wave infrared images. All images in ISDD are preprocessed by radiometric calibration and FLAASH atmospheric correction. In order to expand the diversity of the dataset, ISDD collects images taken in the United States, China, Japan, Australia, Europe, etc.
ISDD contains a total of 1284 infrared images with 3061 ship instances. The size of images is 500
C. Properties of ISDD
The proposed ISDD has the following properties.
Small Scale of Ship Instances: As shown in Fig. 4(b), the proportion of instance area to the whole image area ranges from 0.014% to 2.39%, with the average proportion of only 0.18%. Considering the height of bounding box as a reflection of the ship size, Fig. 4(c) shows the instance size distribution comparison between three different ship detection datasets, including ISDD, an optical ship recognition dataset called dataset for oriented ship recognition (DOSR) [37], and the widely used SAR ship detection dataset (SSDD) [38]. The height of ships in ISDD ranges from 4 to 73 pixels. Instances in ISDD have an average height of 19.59 pixels, which is much smaller than 71.22 of DOSR and 58.94 of SSDD. The above statistics indicate that the scale of the ships in ISDD is quite small, which brings great challenges to the detection networks.
Scene Diversity: The various scenarios in ISDD are complex and close to practical applications. According to the state of the ship stern, there are two types of scenes: ships with trail and without trail. As shown in Fig. 5(a), most ship trailing waves are manifested as bright stripes of water streaks behind the ships, and a few trailing waves appear as sectorial water streaks behind the ships. For ships with trail, ISDD only annotates the hull part but not the trail part. In addition, there are not only a large number of offshore scenes in ISDD but also a wealth of nearshore scenes with more complex backgrounds. ISDD contains 373 inshore scenes with 924 instances and 911 offshore scenes with 2137 instances. As shown in Fig. 5(b), inshore scenes contain many buildings on and near land, which may greatly increase the false alarm rate. In addition, there are also moored ships shown in Fig. 5(c), which are easily submerged by land areas with similar grayscale and lead to missing detection. Moreover, nearshore scenes also have the grayscale inversion phenomenon due to diurnal temperature fluctuation shown in Fig. 2.
Weather Condition Diversity: Images in ISDD are shooting in a variety of weather conditions. A typical scenario is the windy weather shown in Fig. 5(d), where irregular ripples appear on the sea surface. Fig. 5(e) and (f) shows the scenes of thin and thick clouds, respectively. Under the weather with thin clouds, the contrast between ships and background will be sharply reduced, while the dotted thick clouds are severe clutter and are prone to be detected as false alarms.
Publicly Available: As the first public dataset for ship detection in infrared remote sensing images, ISDD is publicly available on GitHub: https://github.com/yaqihan-9898/ISDD.
Scene diversity and weather condition diversity in ISDD: (a) ships with trail; (b) inshore scenes; (c) berthing scene; (d) sea wave scenes; (e) thin clouds scenes; and (f) thick clouds scenes.
Methodology
The overall framework of the proposed KCPNet is illustrated in Fig. 6. KCPNet is an end-to-end network with a two-stage structure. In the first stage, balanced feature maps with densely structured receptive fields are extracted by BFF-Net. CA-Net further suppresses clutters against complex scenes and highlights target and contextual information. The enhanced features are sent to RPN to generate proposals. In the second stage, visual features are served as prior knowledge to guide visual awareness learning. Deep features and learned visual features jointly contribute to the final prediction.
Overall framework of KCPNet. There are three subnetworks in KCPNet: 1) BFF-Net efficiently fuses features from all backbone layers and introduces nonlocal information with balanced receptive fields; 2) CA-Net simultaneously enhances targets and corresponding contextual information, and suppresses clutter in complex scenes; and 3) knowledge-driven prediction head learns visual features to conduct final prediction task jointly with deep features and back-propagates knowledge to the whole network.
A. Balanced Feature Fusion Network
The network neck represented by FPN [39] undertakes the task of fusing and postprocessing the backbone features. We design a BFF-Net for infrared ship detection. The proposed BFF-Net balances features in three aspects. First, balance semantic information and resolution in a single feature map layer for small targets. Second, many feature fusion networks [36], [39], [40], [41] add or concate different layers equally. However, in fact, different feature maps contribute differently to the whole detection work. BFF-Net balances the information ratio of different layers and different channels to enhance efficient information and reduce information redundancy. Third, a receptive field expansion module (RFEM) in BFF-Net balances the structure of different receptive fields. The structure of BFF-Net is illustrated in Fig. 7.
Structure of BFF-Net. FFM is the first component of BFF-Net, which fuses backbone features to an appropriate resolution for small targets with a balanced information structure of features from different layers. The fused features are input to the RFEM to catch nonlocal contextual information.
There are two core components in BFF-Net: a feature fusion module (FFM) and an RFEM. We utilize Block1–Block4 of ResNet as our backbone. To speed up convergence and avoid overfitting, we employ Block5 of ResNet and global average pooling (GAP) to substitute for the commonly used two FC layers in the prediction head. Thus, our backbone only generates four feature map layers

Due to the lack of local semantics of small targets, there is an urgent need to extract nonlocal contextual information with large receptive fields as important supplements and compensation. In order to catch balanced nonlocal contextual features and local target features, we propose an RFEM based on dense-connected dilated convolution. The structure of RFEM is depicted in Fig. 8, which can also be formulated as follows:

Structure of RFEM.
The stride of
B. Contextual Attention Network
Ocean scenes can be very complex in practical applications, especially near the shore. Background clutter and false alarms will seriously interfere with the detection accuracy. To solve this problem, attention mechanisms are proposed. Current researches are mainly based on unsupervised attention networks, such as [42], [47], [48], [49], and [50]. However, unsupervised attention networks cannot learn for specific targets. In addition, false alarms closely resemble the appearance of ships, so the features of false alarms are prone to be enhanced with real targets simultaneously. Fortunately, the supervised attention networks utilize ground truth to guide the learning process; thus, it is more advantageous to remove false alarms.
Based on the supervised dual-mask attention network proposed in [37], we design a new CA-Net to break the limitation of thin semantic features of small ships and reduce false alarms in infrared images. The design of CA-Net is based on an important observation: as shown in Fig. 9, in a set of detection result patches mixed with targets and false alarms, even humans cannot quickly identify which are targets. Once we include the surrounding areas of the detected results in the scope of observation (shown in Fig. 10), humans can easily distinguish between targets and false alarms. This observation also holds true for the network learning process, where the networks require not only the target information but also the surrounding areas of the target. Therefore, the CA-Net is proposed to strengthen target and contextual information simultaneously and suppress the background clutter.
Detection result patches mixed with ship targets and false alarms. Even humans cannot tell which are ship targets in a second. Actually, (a)–(d) are ship targets and (e)–(h) are all false alarms.
Detection result patches with surrounding areas. Compared with patches in Fig. 9, patches with surrounding areas are much easier to be identified: (a)–(d) ship targets and (e)–(h) false alarms.
CA-Net is a supervised pixel-level attention network. The framework of CA-Net is illustrated in Fig. 11. The labels of CA-Net are two binarized masks, Mask1 and Mask2, with the same resolution as the input feature map. In the two masks, the pixels in foreground are set to 1, and the background is filled with 0. For Mask1, areas of ground-truth bounding boxes are foreground. For Mask2, we regard areas of

Framework of CA-Net. Mask1 and Mask2 guide CA-Net to generate spatial attention mask
CA-Net no longer puts the limited eye on the target itself but expands the sight to the contextual characteristics of the targets’ surrounding environment. As shown in Fig. 11, compared with original input feature map
C. Knowledge-Driven Prediction Head
1) Structure of Prediction Head:
Due to the black-box nature of CNN, the physical interpretability of deep features is poor. Experiments demonstrate that many false alarms cannot be discerned if only use deep features. Considering that the learning process of CNN simulates the human cognitive process, to solve the false alarm problem from the root, we should analyze why humans can detect ships quickly in an infrared image. One of the key advantages of human beings is that we have sufficient prior knowledge of the ships, such as their shape and contrast, where they might appear. These features with clear physical meaning can be modeled by visual features in model-driven methods. Therefore, we combine the model-driven approach with the data-driven approach and design a novel detection head as depicted in Fig. 12.
Workflow of the knowledge-driven prediction head. Inside the left green box, the solid and dashed lines represent RoI align and crop operation, respectively. The colors of the line above RoIs correspond to the line color in the left green box, which reflects the proposal source and operation to get RoIs. Visual features generated from RoI-OIs are served as training labels, and RoI-FMs are input to five 3
As we analyzed in Section IV-B, contextual information is beneficial to identifying ship targets and nonship targets. Therefore, in knowledge-driven prediction head, RoI align is conducted twice to get basic RoI from feature map (Basic RoI-FM) and context RoI from feature map (Context RoI-FM). Basic RoI-FM is the same as classic RoI in traditional two-stage networks. Note that Context RoI-FM is generated from context proposals, which have
正如我们在 Section IV-B 中分析的那样,上下文信息有助于识别船只目标和非船只目标。因此,在知识驱动的预测头部,对 RoI 进行两次对齐,以从特征图中获得基本 RoI(Basic RoI-FM)和上下文 RoI(Context RoI-FM)。Basic RoI-FM 与传统两阶段网络中的经典 RoI 相同。请注意,Context RoI-FM 是从上下文建议生成的,其长度和宽度是基本建议的
Visual features computed from Basic RoI-OIs and Context RoI-OIs are served as training labels. Basic RoI-FMs and Context RoI-FMs are input to five 3
Intuitively, we can directly use the visual features computed on RoI-OIs to combine with deep features. However, experiments demonstrate that introducing some visual features with high computational complexity will greatly slow down the network’s speed. The supervised learning mechanism used in our prediction head can uniformly calculate different visual features with low compute costs while maintaining the validity of visual features. More importantly, the prior knowledge can be passed to all previous feature extraction processes through the back-propagation mechanism. As the bridge between the model-driven and data-driven approaches, the proposed prediction head exploits their respective strengths and diminishes their respective disadvantages. The loss function of knowledge-driven prediction head will be further introduced in Section IV-D.
2) Design of Visual Features:
The visual features should contain sufficient prior knowledge and be capable of distinguishing ship targets from false alarms. In order to better design visual features, we need to analyze the characteristics of false alarms. As displayed in Fig. 13, there are two main categories of false alarms: cloud false alarms and land false alarms. Cloud false alarms can be further divided into strip clouds and point clouds. Land false alarms contain strip highlight buildings, protruding lands, and small islands. Most false alarms are similar in shape to the ships or have similar contrasts to the surrounding background as the ships. Fortunately, there are also many differences in detail between false alarms and ship targets, such as texture and the regularity of shape.
Detection results with false alarms. Green bounding boxes are ships that are correctly detected. Red bounding boxes are false alarms. For a clearer view of the false alarms, all false alarms are numbered and zoomed in below the image: (a) strip cloud false alarms; (b) point cloud false alarms; (c) strip highlight buildings; (d) No. 1 false alarm is highlight buildings near the port, and No. 2 false alarm is protruding land; and (e) island false alarms.
Based on the analyses above, we selected three types of visual features for the network to learn: geometric features, texture features, and contrast features. The computation of all visual features is based on the RoI-OIs. The detailed descriptions are as follows.
The geometric features include rectangle degree, AR, and the number of connected areas. All RoI-OIs should be binarized first before calculating the geometric features. See Table II for formulaic definitions of three geometric features. Geometric features perform best in removing false alarms with irregular or unclear boundaries such as clouds and islands.
The texture features contain HOG features [51] and GLCM-based features [52]. For HOG features, as illustrated in Fig. 14, we first compute nine-bin HOG features of the Basic RoI-OI and Context RoI-OI, then divide the Context RoI-OI into four corner blocks, and calculate HOG features of each block separately. GLCM-based features include six types of features, including dissimilarity, homogeneity, contrast, energy, correlation, and angular second moment (ASM). Table III lists the formulas of six GLCM-based features. All GLCM-based features are calculated in four different directions, that is, 0,
,(π/4) , and(π/2) . Texture features have a good effect on false alarms with uneven grayscale distribution.(3π/4) The contrast feature is designed to describe the contrast between the RoI and the surrounding background. As illustrated in Fig. 14, we define surrounding background
as the ring area after cutting off the Basic RoI-OI from Context RoI-OI. Due to the effects of atmospheric refraction and optical defocusing, the bodies of infrared ships generally exhibit a Gauss-like morphological feature. However, because of the trail parts of ships, traditional contrast algorithms using eight-neighbor grayscale difference or ratio, such as local contrast measure (LCM) [53] and multiscale patch-based contrast measure (MPCM) [54], are no longer suitable for reflecting ship contrast. As shown in Fig. 15(a), MPCM can identify cloud and port false alarms, but the contrast of trailing ships is incorrectly low, and island false alarms cannot be distinguished correctly. In practice, strip buildings near the port, as shown in Fig. 13(d), are easily confused with ships with trail and be detected as false alarms. It is worth noting that the grayscale of the ship’s trailing area usually decreases unevenly, while grayscale of the port buildings is often evenly distributed along a specific direction. Considering the grayscale distribution characteristic and Gauss-like morphological feature of the ship body, regional intensity level (RIL) [55] is a good measure of the grayscale complexity. We propose a local contrast algorithm for infrared ship targets based on RIL, which can effectively distinguish ships and false alarms. The formulaic expression of our contrast feature is as follows:B View SourceK=mT=RILT=mB=RILB=k⋅w⋅ha,MT=1K∑i=1KIT(i)1N∑i=1NIT(i)MT−mT,MB=1K∑i=1KIB(i)1M∑i=1MIB(i)MB−mB,W=RIL2TRILB+ε(9) \begin{align*} K=&k\cdot \frac {w\cdot h}{a},\quad M_{T} =\frac {1}{K}\sum _{i=1}^{K} I_{T} \left ({i}\right) \\ m_{T}=&\frac {1}{N} \sum _{i=1}^{N} {I_{T} \left ({i}\right)} \\ \text {RIL}_{T}=&M_{T} -m_{T},\quad M_{B} =\frac {1}{K}\sum _{i=1}^{K} I_{B} \left ({i}\right) \\ m_{B}=&\frac {1}{M} \sum _{i=1}^{M} {I_{B} \left ({i}\right)} \\ \text {RIL}_{B}=&M_{B} -m_{B},\quad W=\frac {\text {RIL}_{T}^{2}}{\text {RIL}_{B} +\varepsilon }\tag{9}\end{align*}
where
denotes the target RoI-OI area, andT denotes the background area.B is a hyperparameter and is usually set to 5–k is another hyperparameter and is usually set to the average area of Context RoI-OIs.20. a andw are the width and height ofh , respectively.B is the average value of the topMT largest pixels inK , andT is the average value of all pixels inmT (likewise forT andMB ).mB is a very small number close to 0, and the final outputε is the contrast feature.W
Region division strategy for computing HOG features and definition of the surrounding background area
Comparison of contrast features of ships and false alarms: (a) MPCM and (b) proposed contrast feature. Green boxes are ships, and red boxes are false alarms. Contrast features greater than threshold are regarded as ships and marked in green, while features less than threshold are considered nonships and marked in red. Thresholds for MPCM and the proposed contrast feature are set to 6.0 and 200.0, respectively.
Fig. 15(b) displays some calculation results of the proposed contrast feature with
To sum up, we get 3-D geometric features,
D. Loss Function
Multitask learning strategy is adopted to realize end-to- end training. Compared with common two-stage detection networks, there are two more supervised learning tasks in KCPNet: mask generation task in CA-Net and visual feature regression task in knowledge-driven prediction head. With the two new tasks, the loss function of KCPNet is

Experiment
In this section, we first introduce the experimental environment and hyperparameter settings in detail. Then, we conducted groups of ablation experiments to explore the effectiveness and superiority of each network component. Finally, we compare the performance of KCPNet with other state-of-the-art networks on ISDD. The proposed KCPNet achieves the best performance on multiple evaluation metrics.
A. Experimental Settings
1) Experimental Environment:
To be fair, all experiments in this section are conducted on the same server, which has Unbantu 18.04 operation system, CPU Intel1 Core i9-10940X CPU at 3.30 GHz, and GPU RTX TITAN with 24G memory. We implement the proposed KCPNet on Tensorflow framework.
2) Training Strategy and Network Settings:
a) Data augment:
Since the lack of public datasets for ship detection in infrared remote sensing images, we adopt our ISDD as the main test dataset for experiments. ISDD contains 1284 images. To reduce the risk of overfitting under few-shot learning, we apply some online data augmentation strategies, including horizontal flip, vertical flip, and random rotation of 90°, 180°, and 270°.
b) Training strategy:
We train KCPNet with 60k total training iterations, initialize the learning rate of 0.001, and decay weight of 0.0001. The learning rate decays ten times and 100 times at 20k iteration and 50k iteration, respectively. For the weights of different training tasks in loss function equation (10),
c) Network settings:
ResNet101 is selected as the backbone in experiments. We initialize the backbone parameters with pretrained ResNet101 on ImageNet and freeze parameters in
3) Evaluation Metrics:
We adopt AP@50, AP@50:95, APS@50, APM@50, and APL@50 in widely used COCO metrics as our evaluation metrics. Moreover, in practical applications, we can only select a certain threshold to obtain images of the test results, so we also evaluate Precision, Recall, and F1-Score with confidence threshold of 0.5 and IoU threshold of 0.5. It is worth noting that there are no ships in ISDD that belong to large targets in COCO standards. Thus, we redefine the size division standard. Ship with area in range of [0,
B. Ablation Studies
In this section, we discuss the effectiveness and superiority of each component in the proposed KCPNet and explore the influence of some detailed network settings. Faster R-CNN with backbone Resnet101 is chosen as the baseline for comparison. Hyperparameters keep consistent across all ablation experiments. The results of ablation studies are listed in Table IV. Fig. 16 displays examples of multiscene performance comparison.
Performance of four networks in ablation studies in five typical scenes of ISDD. The green, yellow, and red bounding boxes represent correctly detected ships, missing ships, and false alarms: (a) baseline (Net1); (b) baseline + BFF-Net (Net2); (c) baseline + BFF-Net + CA-Net (Net5); and (d) proposed KCPNet (Net8). The comparison between (a) and (b) reflects that BFF-Net can effectively improve the recall of small ships. The results of (c) and (d) demonstrate CA-Net and knowledge-driven prediction head further boost the ability to remove false alarms under complex scenes.
1) Effect of BFF-Net:
BFF-Net fuses features from all backbone layers to a proper resolution for small targets with insignificant scale variances. RFEM in BFF-Net obtains nonlocal information with balanced receptive fields to make up for the weak semantic features of infrared ships. As shown in Table IV, baseline (Net1) only gets AP@50 of 84.53%, Recall of 82.59%, and APS of 80.31%. After introducing BFF-Net (Net2), AP@50 increases to 85.65%, and Recall and APS raise by 5.00% and 2.01%, respectively. Comparing Fig. 16(a) and (b), BFF-Net effectively improves the recall of small targets. Unfortunately, the increase in Recall sacrifices Precision, which may attribute to the increase of false alarms under the larger feature map. Comparing KCPNet with and without BFF-Net (Net7 and Net8), there are significant improvements on all metrics after introducing BFF-Net, especially APS (+3.81%), APM(+2.63%), and AP@50 (+2.59%).
To study the superiority of the internal components of the BFF, we replace the proposed RFEM with other popular RFEMs for testing, including ASPP [43], efficient spatial pyramid (ESP) [46], and C5 [44]. FPN is also implemented to compare the performance of other network necks. The experimental results are listed in Table V. Without RFEM, BFF-Net can also improve APS@50 by 3.16%, but APM@50 and APL@50 have a slight decline, which may be caused by the stride decrease of the fused layer. The introduction of RFEM can not only make up for the above shortcomings, but also effectively improve the performance of all aspects. It is obvious that the proposed RFEM achieves the best performance among other popular RFEMs. Thus, BFF-Net is recommended to be used together with RFEM. Compared with FPN, BFF-Net performs better on all metrics except APL, reflecting that BFF-Net is more suitable for small targets with slight scale variance.
2) Effect of CA-Net:
Considering the key role of contextual information for small targets, CA-Net is proposed to highlight information of targets and their surrounding areas, and suppress clutter in complex scenes. As shown in Table IV, CA-Net boosts AP@50 from 84.53%(Net1) to 87.74% (Net3) and improves average precision (AP) under all scales significantly. As shown in Fig. 16(b) and (c), CA-Net significantly reduces land false alarms in complex scenes. Compared with Net6, Net8 with CA-Net achieves 7.84% increase in Precision, thus demonstrating CA-Net effectively suppresses severe background clutter and further reduces false alarms in complex scenes. The cooperation of CA-Net and BFF-Net (Net5) raise the AP@50 by 4.54% (from 84.53% to 89.07%) and AP@50:95 by 2.47% (from 39.65% to 42.12%).
Moreover, we conduct a group of comparison experiments with different hyperparameters to figure out the individual influence of components in CA-Net such as Mask1, Mask2, and shortcut. The results are shown in Table VI. It can be observed that there is a sharp performance decrease (−2.27% of AP@50) in CA-Net without shortcut. The main reason lies in the shortcut that can prevent feature degradation in background regions. CA-Net with single
此外,我们进行了一组使用不同超参数的对比实验,以确定 CA-Net 中各个组件(例如 Mask1、Mask2 和捷径)的个体影响。结果显示在 Table VI 。可以观察到,没有捷径的 CA-Net 性能急剧下降(AP@50 下降了-2.27%)。主要原因在于捷径可以防止背景区域特征退化。具有单个
With the hyperparameter setting above, some visualization results of CA-Net are displayed in Fig. 17. It can be observed that the mask
使用上述超参数设置,CA-Net 的一些可视化结果显示在 Fig. 17 。可以观察到,CA-Net 生成的掩码
Visualization results of CA-Net: (a) original input image; (b) input feature map
Furthermore, we are also curious about what happens if we add an
3) Effect of Knowledge-Driven:
Inspired by human cognitive processes, knowledge-driven prediction head introduces visual features as prior knowledge, breaking the limitation of black-box properties of CNN to a certain extent. As shown in Table IV, knowledge-driven prediction head improves AP@50 from 84.53% (Net1) to 86.86% (Net4). In addition, the Precision of Net5 raises by 4.28% after introducing knowledge-driven prediction head. Fig. 16(d) demonstrates that knowledge-driven prediction head improves the ability of the network to identify false alarms. It is worth noting that networks with only one or two subnetworks fail to achieve APS higher than 85%, but the proposed KCPNet with all three components booms APS to 87.31%, indicating that the interaction of the three subnetworks can maximize the detection performance of small targets.
受人类认知过程的启发,知识驱动的预测头部将视觉特征作为先验知识引入,在一定程度上打破了 CNN 黑盒特性的限制。如 Table IV 所示,知识驱动的预测头部将 AP@50 从 84.53%(Net1)提高到 86.86%(Net4)。此外,引入知识驱动的预测头部后,Net5 的精确度提高了 4.28%。 Fig. 16(d) 表明,知识驱动的预测头部提高了网络识别误报的能力。值得注意的是,仅包含一个或两个子网络的网络未能达到 AP S 高于 85%的水平,但所提出的包含所有三个组件的 KCPNet 将 AP S 提升至 87.31%,表明三个子网络的交互作用可以最大化小型目标的检测性能。
Reasonable design of visual features is the key point to the knowledge-driven detection head. We experiment the network performance under 11 different combinations of visual features and three normalization strategies. In Table VIII, G1–G11 display the influence of visual feature design. The comparison between G1 and G3 reflects that increase in the variety of features does not lead to higher performance. On the contrary, appropriate feature selection is more important for optimal performance. G4–G6 experiment three combinations of texture features. GLCM gets higher APS and AP@50, while GLCM + Hog achieves higher AP@50:95. To better interact with HoG features from Contextual RoIs, we choose GLCM + Hog as texture features in KCPNet. G7 and G8 show that dividing the Contextual RoI into four subblocks to compute additional HOG features can further improve performance. G9–G11 verify the effectiveness of the proposed contrast feature.
合理的视觉特征设计是知识驱动检测头的关键点。我们实验了网络性能,使用了 11 种不同的视觉特征组合和 3 种归一化策略。在 Table VIII ,G1–G11 展示了视觉特征设计的影响。G1 和 G3 的比较表明,特征多样性的增加并不导致更高的性能。相反,合适的特征选择对于最佳性能更重要。G4–G6 实验了三种纹理特征组合。GLCM 获得了更高的 AP S 和 AP@50,而 GLCM + Hog 获得了更高的 AP@50:95。为了更好地与来自语境 RoI 的 HoG 特征交互,我们在 KCPNet 中选择了 GLCM + Hog 作为纹理特征。G7 和 G8 表明,将语境 RoI 分成四个子块来计算额外的 HOG 特征可以进一步提高性能。G9–G11 验证了所提出的对比特征的有效性。
表 VIII 知识驱动预测头部不同视觉特征设计的比较
G11–G13 in Table VIII explore the effect of whether and where use the normalization. Without normalization, G12 only achieves AP@50 of 89.96% (−1.03%) and APS of 85.52% (−1.79%), demonstrating the key role of normalization. G13 conducts normalization before prediction, which means using normalized features as training labels. The performance of G13 is slightly inferior to G11, which indicates that the location of normalization also has a certain impact on the network.
G11–G13 在 Table VIII 中探索了归一化使用与否以及使用位置的影响。未进行归一化,G12 仅达到 AP@50 89.96% (−1.03%) 和 AP S 85.52% (−1.79%),这证明了归一化的关键作用。G13 在预测前进行归一化,这意味着使用归一化特征作为训练标签。G13 的性能略逊于 G11,这表明归一化的位置也对网络有一定的影响。
C. Comparison With the State-of-the-Art
与现有技术的比较
To impartially evaluate the performance of KCPNet, we implement eight state-of-the-art detection networks on ISDD for comparison, including Faster RCNN [13], SSD [15], CenterNet [16], RetinaNet [25], YOLO [56], FCOS [57], EfficientDet [58], Cascade RCNN [59], and Sparse RCNN [60]. For all experiments, hyperparameters such as NMS threshold and RPN parameters in two-stage networks keep consistent with KCPNet. To be fair, except for YOLO, EfficientDet, and CenterNet, which develop their new backbones, we implement all networks above with RestNet101. We adjust training settings such as anchor settings and training epochs for several networks to exert their best performance on ship detection task. We implement RetinaNet, YOLO, Centernet, and Sparse RCNN based on Pytorch because the accuracy performance of their Tensorflow versions is not satisfactory. It should be clarified that different deep learning frameworks (Tensorflow/Pytorch) may affect the speed and memory comparison to some extent. Table IX
为了公正地评估 KCPNet 的性能,我们在 ISDD 上实现了八种最先进的检测网络进行比较,包括 Faster RCNN [13] ,SSD [15] ,CenterNet [16] ,RetinaNet [25] ,YOLO [56] ,FCOS [57] ,EfficientDet [58] ,Cascade RCNN [59] 和 Sparse RCNN [60] 。所有实验中,两阶段网络中的 NMS 阈值和 RPN 参数与 KCPNet 保持一致。为了公平起见,除了 YOLO、EfficientDet 和 CenterNet(它们开发了新的主干网络)之外,我们所有上述网络都使用 ResNet101。我们调整了训练设置,例如锚点设置和训练轮数,以使几种网络在船舶检测任务上达到最佳性能。我们基于 Pytorch 实现 RetinaNet、YOLO、CenterNet 和 Sparse RCNN,因为它们的 TensorFlow 版本的准确性性能不够理想。应该明确的是,不同的深度学习框架(Tensorflow/Pytorch)在一定程度上会影响速度和内存比较。 Table IX
P–R curves of state-of-the-art detection networks on ISDD with IoU threshold of 0.5.
先进检测网络在 ISDD 数据集上,IoU 阈值为 0.5 时的 P–R 曲线。
D. Discussion 讨论
To explore how the introduced knowledge works in the networks, we carefully study the learning process of visual features. We find that the knowledge-driven prediction head does not learn every visual feature precisely. In addition, as shown in Fig. 19, the regression loss of visual features is somewhat oscillating, which may affect the learning of other tasks. The main reason lies in smooth-L1 loss may not meet the requirement of high-precision regression of high-dimension features. Thus, in our future work, we will further design the loss functions which are more suitable for visual feature regression task. Another possible solution is to quantify visual features into discrete quantitative levels, turning the regression task into a classification task, which may converge more stably in training.
为了探究引入的知识如何在网络中发挥作用,我们仔细研究了视觉特征的学习过程。我们发现,知识驱动的预测头并不能精确学习所有视觉特征。此外,如 Fig. 19 所示,视觉特征的回归损失存在一定程度的震荡,这可能会影响其他任务的学习。主要原因在于平滑 L1 损失可能无法满足高维特征的高精度回归需求。因此,在未来的工作中,我们将进一步设计更适合视觉特征回归任务的损失函数。另一种可能的解决方案是将视觉特征量化为离散的定量级别,将回归任务转换为分类任务,这可能在训练中更稳定地收敛。
What is more, the proposed KCPNet passes the knowledge of visual features to each layer of the entire network through back-propagation. However, the knowledge is not as effective in low-layer features as in high-layer features. Therefore, we consider truncating the back-propagation process of visual feature regression at suitable layers to optimize the learning process of visual features and improve the efficiency of back-propagation.
此外,所提出的 KCPNet 通过反向传播将视觉特征知识传递到整个网络的每一层。然而,在低层特征中,该知识不如高层特征有效。因此,我们考虑在合适的层截断视觉特征回归的反向传播过程,以优化视觉特征的学习过程并提高反向传播的效率。
Conclusion 结论
In this article, we proposed a KCPNet for infrared ship detection under complex ocean environment. A BFF-Net was designed to improve the recall and precision of all scale ships, especially small ships, through balancing semantic and location information and generating nonlocal features with balanced receptive fields. Then, to cope with severe clutter in complex scenes and thin semantic information of infrared ships, we proposed a pixel-level attention network CA-Net to highlight the targets and their contextual information, and suppress background clutter simultaneously. What is more, we modeled handcraft visual features into prediction head to drive the prior knowledge propagandizing throughout the whole network, which diminishes the false alarms and further boosts the detection performance. In addition, as far as we knew, ISDD published in this article is the first public benchmark for infrared ship detection. Extensive experiments demonstrate that our KCPNet achieves state-of-the-art performance on ISDD. In future work, we will further explore more efficient models to disseminate knowledge in data-driven approaches. In addition, ISDD will keep updating with more images and scenes that are closer to practical applications.
本文提出了一种用于复杂海洋环境下红外船舶检测的 KCPNet。设计了一种 BFF-Net,通过平衡语义和位置信息,生成具有平衡感受野的非局部特征,从而提高所有尺度船舶,特别是小型船舶的召回率和精确率。此外,为了应对复杂场景中的严重杂波和红外船舶的薄弱语义信息,我们提出了一种像素级注意力网络 CA-Net,同时突出目标及其上下文信息,并抑制背景杂波。此外,我们将手工制作的视觉特征建模到预测头中,以驱动先验知识在整个网络中传播,从而减少误报并进一步提高检测性能。据我们所知,本文中发布的 ISDD 是第一个公开的红外船舶检测基准。大量的实验表明,我们的 KCPNet 在 ISDD 上达到了最先进的性能。未来工作中,我们将进一步探索更有效率的模型,以在数据驱动方法中传播知识。 此外,ISDD 将继续更新更多更贴近实际应用的图像和场景。
ACKNOWLEDGMENT 致谢
The authors would like to thank the editor and the anonymous reviewers who gave constructive comments and helped to improve the quality of this article. The same appreciation goes to the published code of Yang’s models, Zhou’s model, and Luo’s model for comparison.
作者感谢编辑和匿名审稿人提出的建设性意见,帮助提高了本文的质量。同样的感谢也给予了杨模型、周模型和罗模型的公开代码,用于比较。