(accv) Package accv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
（ACCV）软件包 accv 警告：软件包 'hyperref' 加载了选项 'pagebackref'，这 *不* 推荐用于相机就绪版本

¹¹institutetext: Department of Computer Science, Yonsei University, Seoul, Republic of Korea
¹研究所文本：韩国首尔延世大学计算机科学系
¹¹email: {skd, jyy1551, rlwjd4177, sanghyun}@yonsei.ac.kr

VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection
VideoPatchCore：一种用于视频异常检测的正常性的有效方法

Sunghyun Ahn\orcidlink0000-0001-9881-5731 11 Youngwan Jo\orcidlink0000-0003-2773-7670 11 Kijung Lee\orcidlink0000-1111-2222-3333 11 Sanghyun Park\orcidlink0000-0002-5196-6193 Corresponding author: Sanghyun Park ()
17th Asian Conference on Computer Vision (ACCV 2024)11 sanghyun@yonsei.ac.kr

Abstract 抽象

Video anomaly detection (VAD) is a crucial task in video analysis and surveillance within computer vision. Currently, VAD is gaining attention with memory techniques that store the features of normal frames. The stored features are utilized for frame reconstruction, identifying an abnormality when a significant difference exists between the reconstructed and input frames. However, this approach faces several challenges due to the simultaneous optimization required for both the memory and encoder-decoder model. These challenges include increased optimization difficulty, complexity of implementation, and performance variability depending on the memory size. To address these challenges, we propose an effective memory method for VAD, called VideoPatchCore. Inspired by PatchCore, our approach introduces a structure that prioritizes memory optimization and configures three types of memory tailored to the characteristics of video data. This method effectively addresses the limitations of existing memory-based methods, achieving good performance comparable to state-of-the-art methods. Furthermore, our method requires no training and is straightforward to implement, making VAD tasks more accessible. Our code is available online at github.com/SkiddieAhn/Paper-VideoPatchCore.
视频异常检测（VAD）是计算机视觉中视频分析和监控的一项关键任务。目前，VAD 通过存储正常帧特征的内存技术而受到关注。存储的特征用于帧重建，当重建帧和输入帧之间存在显著差异时识别异常。然而，由于内存和编码器-解码器模型需要同时进行优化，因此这种方法面临一些挑战。这些挑战包括优化难度增加、实现复杂性以及性能变化（具体取决于内存大小）。为了应对这些挑战，我们提出了一种有效的 VAD 内存方法，称为 VideoPatchCore。受 PatchCore 的启发，我们的方法引入了一种结构，该结构优先考虑内存优化，并根据视频数据的特性配置了三种类型的内存。这种方法有效地解决了现有基于内存的方法的局限性，实现了与最先进的方法相当的良好性能。此外，我们的方法不需要培训并且易于实施，使 VAD 任务更易于访问。我们的代码可在 github.com/SkiddieAhn/Paper-VideoPatchCore 在线获取。

Keywords:

Video Anomaly Detection Machine learning Computer vision

关键字：

视频异常检测机器学习计算机视觉

Refer to caption — (a) Memory-augmented Method
（a）记忆增强法

1 Introduction 1 介绍

The task of video anomaly detection (VAD) involves identifying anomalies within video sequences. Typically, deep learning-based VAD methods employ one-class classification (OCC) techniques as anomalies are rare and it is challenging to define them precisely. OCC methods train models exclusively with normal data. In VAD, prominent OCC approaches include reconstruction and prediction-based methods [10, 25, 38, 18, 36]. These approaches utilize encoder-decoder models such as autoencoders (AEs), generative adversarial networks (GANs) [9], and others. During training, these models train to reconstruct or predict frame using only normal data, operating under the assumption that anomalies will result in high reconstruction or prediction errors during testing. However, these methods face challenges due to the exceptional generalization ability of deep learning models, which can generate plausible outputs for anomalous data [38, 13, 14, 12].
视频异常检测（VAD）的任务涉及识别视频序列中的异常。通常，基于深度学习的 VAD 方法采用单类分类（OCC）技术，因为异常很少见，并且很难准确定义它们。OCC 方法仅使用正常数据训练模型。在 VAD 中，突出的 OCC 方法包括重建和基于预测的方法 [10， 25， 38， 18， 36]。这些方法利用编码器-解码器模型，例如自动编码器（AE）、生成对抗网络（GAN） [9] 等。在训练期间，这些模型训练仅使用正常数据重建或预测帧，假设异常将导致测试期间的高重建或预测误差。然而，由于深度学习模型的特殊泛化能力，这些方法面临着挑战，它可以为异常数据生成合理的输出 [38， 13， 14， 12]。

Recently, methods that store the features of normal data in memory during training have shown improved performance in reducing the generalization ability for anomalies. Memory-augmented VAD approaches [8, 26, 3, 23, 20, 37, 31, 19, 32] store the features of normal frames in memory and retrieve similar features from the memory to generate the input or future frame during testing. In such cases, anomalies result in higher reconstruction or prediction errors because their similar features are not stored in the memory. Fig. 1(a) illustrates these memory utilization approaches. While memory-augmented methods demonstrate good performance, they also present several challenges. First, the simultaneous optimization of the model and memory makes the optimization process difficult and the implementation complex. Additionally, the size of the memory significantly impacts performance [41, 2]. Although many recent studies [26, 32] have focused on reducing memory size, finding an appropriate memory size for large video datasets remains a challenge.
最近，在训练期间将正常数据的特征存储在内存中的方法在降低异常的泛化能力方面显示出更好的性能。内存增强 VAD 方法 [8， 26， 3， 23， 20， 37， 31， 19， 32] 将普通帧的特征存储在内存中，并从内存中检索类似的特征，以在测试期间生成输入帧或未来帧。在这种情况下，异常会导致更高的重建或预测误差，因为它们的相似特征未存储在内存中。无花果。 1（a）说明了这些内存利用率方法。虽然内存增强方法表现出良好的性能，但它们也带来了一些挑战。首先，模型和内存的同步优化使优化过程变得困难，实现复杂。此外，内存的大小会显著影响性能 [41， 2]。尽管最近的许多研究 [26， 32] 都集中在减小内存大小上，但为大型视频数据集找到合适的内存大小仍然是一个挑战。

Inspired by PatchCore[30], we propose VideoPatchCore (VPC) to address the limitations of the previously mentioned memory utilization methods. As shown in Fig. 1(b), VPC leverages the vision encoder of the pre-trained CLIP [27] without additional training and optimizes memory using greedy coreset subsampling [1]. This enables effective VAD with a small memory footprint even for large datasets. Unlike PatchCore, which is used at the image level, VPC performs anomaly detection at the video level by using two streams: local and global. The local stream detects anomalies in individual objects, while the global stream detects anomalies in entire frames. The local stream employs a spatial memory bank to identify the appearance anomalies and a temporal memory bank to detect action anomalies. The global stream utilizes a high-level semantic memory bank to detect anomalies in interactions among multiple objects and abnormalities that can only be identified by analyzing the scene (e.g., wrong direction). By utilizing spatial and temporal memory banks, VPC considers the spatiotemporal characteristics of videos, and with the two-stream approach, it can detect various forms of anomalies.
受 PatchCore[30] 的启发，我们提出了 VideoPatchCore （VPC）来解决前面提到的内存利用方法的局限性。如图 1 所示。 1（b）中，VPC 利用预训练的 CLIP [27] 的视觉编码器，无需额外训练，并使用贪婪核心集子采样 [1] 优化内存。即使对于大型数据集，这也能以较小的内存占用实现有效的 VAD。与在图像级别使用的 PatchCore 不同，VPC 使用两个流（本地和全局）在视频级别执行异常检测。本地流检测单个对象中的异常，而全局流检测整个帧中的异常。本地流使用空间内存库来识别外观异常，并使用时间内存库来检测操作异常。全局流利用高级语义内存库来检测多个对象之间交互的异常以及只能通过分析场景来识别的异常（例如，错误的方向）。通过利用时空记忆库，VPC 考虑了视频的时空特征，并通过双流方法检测各种形式的异常。

Through experiments, we demonstrate the effectiveness of using three memory banks in the video domain. Additionally, we introduce two methods for effectively memorizing video characteristics and demonstrate the robustness of performance concerning memory size on large datasets. Our proposed VPC shows good performance comparable to state-of-the-art techniques. We anticipate that VPC will enhance the accessibility of the VAD field and facilitate its use in real-world applications.
通过实验，我们证明了在视频域中使用三个内存库的有效性。此外，我们介绍了两种有效记忆视频特征的方法，并展示了在大型数据集上内存大小方面的性能稳健性。我们提出的 VPC 显示出与最先进的技术相当的良好性能。我们预计 VPC 将增强 VAD 字段的可访问性，并促进其在实际应用中的使用。

Our contributions are as follows:
我们的贡献如下：

•

We propose VPC, an extension of PatchCore developed for image anomaly detection, to perform effective video anomaly detection.

• 我们提出了 VPC，这是为图像异常检测而开发的 PatchCore 的扩展，用于执行有效的视频异常检测。
•

VPC employs two streams (local and global) and three memory banks (spatial, temporal, and high-level semantic) to capture the spatiotemporal characteristics of videos and detect various forms of anomalies.

• VPC 采用两个流（本地和全局）和三个内存库（空间、时间和高级语义）来捕获视频的时空特征并检测各种形式的异常。
•

VPC achieves good performance comparable to state-of-the-art methods.

• VPC 实现了与最先进的方法相媲美的良好性能。

2 Related Work 阿拉伯数字相关工作

2.1 Reconstruction & prediction-based Video Anomaly Detection
2.1 基于重建和预测的视频异常检测

The reconstruction-based approach is based on the assumption that a model trained to reconstruct normal video frames fails to accurately reconstruct abnormal frames. Nguyen [25] proposed a model comprising one encoder and two decoders. This model considered the $t^{th}$ frame as the input and predicted both the reconstructed frame and the optical flow between the $t^{th}$ and $t+1^{th}$ frames. Zaheer [38] used GANs to generate high-quality reconstructed frames. In this approach, the discriminator was trained to classify high-quality frames as normal and low-quality frames as abnormal, utilizing a generator in its previous state that produced low-quality frames.
基于重建的方法基于这样一个假设，即训练用于重建正常视频帧的模型无法准确重建异常帧。Nguyen [25] 提出了一个由一个编码器和两个解码器组成的模型。该模型将 $t^{th}$ 帧视为输入，并预测重建的帧和 $t^{th}$ 和 $t+1^{th}$ 帧之间的光流。Zaheer [38] 使用 GAN 生成高质量的重建帧。在这种方法中，判别器被训练为将高质量帧分类为正常帧，将低质量帧分类为异常帧，利用生成低质量帧的先前状态的生成器。

The prediction-based approach is based on the assumption that when a model is trained to predict future or past frames of normal video sequences, it fails to accurately predict abnormal frames. Liu [18] utilized FlowNet and GANs to predict the $t+1^{th}$ frame from $t$ frames. Yang [36] suggested selecting key frames from $t$ frames and predicting all $t$ frames based only on these key frames.
基于预测的方法基于以下假设：当模型经过训练以预测正常视频序列的未来或过去帧时，它无法准确预测异常帧。Liu [18] 利用 FlowNet 和 GAN 从帧中 $t$ 预测 $t+1^{th}$ 帧。Yang [36] 建议从 $t$ 帧中选择关键帧，并仅根据这些关键帧预测所有 $t$ 帧。

2.2 Video Anomaly Detection using Memory
2.2 元使用内存进行视频异常检测

In VAD, memory is utilized to address the issue of accurately generating abnormal frames in OCC method. Gong [8] proposes a method that stores the features of normal frames in memory and generates normal frames using these stored memory items. Park [26] suggests a method for learning various patterns of normal data by clustering similar features and using the cluster centers as memory items, thus storing numerous normal features in a compact memory. Sun [32] employs contrastive learning to cluster features based on their classes and stores each feature in memory. This technique allows for the creation of a compact memory by leveraging the semantic information of objects.
在 VAD 中，利用内存来解决 OCC 方法中准确生成异常帧的问题。Gong [8] 提出了一种方法，将普通帧的特征存储在内存中，并使用这些存储的内存项生成普通帧。Park [26] 提出了一种学习法线数据各种模式的方法，该方法通过将相似特征聚类并使用簇中心作为内存项，从而在紧凑的内存中存储大量法线特征。Sun [32] 采用对比学习根据特征的类对特征进行聚类，并将每个特征存储在内存中。该技术允许通过利用对象的语义信息来创建紧凑的内存。

2.3 Representation-based Image Anomaly Detection
2.3 基于表示的图像异常检测

In the image domain, representation-based approaches have gained attention as effective methods for anomaly detection. This approach utilizes pre-trained networks on large-scale datasets such as ImageNet[7] and compares the embeddings of normal and abnormal images to detect anomalies. Cohen[6] proposes feature pyramid matching, leveraging the different information at each layer of the network. Features from the training data are stored in memory, and anomaly scores for the features of test data are computed using k-NN with respect to the features stored in memory. Roth[30] points out that high-level features of pre-trained networks are specialized for classification purposes, indicating the need to extract features at the middle level. These features are stored in memory, and memory optimization is achieved through Coreset Subsampling.
在图像领域，基于表示的方法作为异常检测的有效方法而受到关注。这种方法利用大规模数据集（如 ImageNet）上的预训练网络[7]，并比较正常和异常图像的嵌入以检测异常。Cohen[6] 提出了特征金字塔匹配，利用网络每一层的不同信息。训练数据中的特征存储在内存中，测试数据特征的异常分数使用 k-NN 相对于内存中存储的特征计算。Roth[30] 指出，预训练网络的高级特征专门用于分类目的，这表明需要在中间阶段提取特征。这些功能存储在内存中，内存优化是通过 Coreset Subssampling 实现的。

3 Method 3 方法

3.1 Overview 3.1 概述

Fig. 2 depicts the proposed VPC model, consisting of two streams: the local stream at the object level and the global stream at the frame level. Each stream extracts features through a vision encoder from CLIP, pre-trained on large-scale datasets, and these features are segmented according to the type of memory and accessed in the memory bank. During this process, the memorization and inference stages are performed. In the memorization stage, features are stored in memory, which is then optimized using greedy coreset subsampling from PatchCore. In the inference stage, anomaly scores are derived by calculating distances between memory and features. The following provides a detailed description of each component of the proposed VPC.
无花果。图 2 描述了建议的 VPC 模型，该模型由两个流组成：对象级别的本地流和帧级别的全局流。每个流都通过 CLIP 的视觉编码器提取特征，在大规模数据集上进行预训练，这些特征根据内存类型进行分割，并在内存库中访问。在此过程中，将执行记忆和推理阶段。在记忆阶段，特征存储在内存中，然后使用 PatchCore 的贪婪核心集子采样进行优化。在推理阶段，异常分数是通过计算内存和特征之间的距离来得出的。下面提供了建议的 VPC 的每个组件的详细说明。

Vision Encoder. We adopt a CNN-based CLIP model as the Encoder and utilize layers 2, 3 $\varphi_{2}$ , $\varphi_{3}$ similar to PatchCore. Typically, the shallow layers of the network capture the local features, while deeper layers capture the global features. However, the first layer captures overly detailed features such as edges and corners, and the last layer captures features biased towards classification, which are not helpful for anomaly detection. Thus, we process and utilize features from the intermediate layers $\varphi_{2}$ , $\varphi_{3}$ .
Vision Encoder （视觉编码器）。我们采用基于 CNN 的 CLIP 模型作为编码器，并使用第 2、3 层 $\varphi_{2}$ ， $\varphi_{3}$ 类似于 PatchCore。通常，网络的浅层捕获局部特征，而较深层捕获全局特征。但是，第一层捕获过于详细的特征（如边缘和角落），最后一层捕获偏向于分类的特征，这对异常检测没有帮助。因此，我们处理和利用中间层 $\varphi_{2}$ 的特征。 $\varphi_{3}$

Local Stream. The local stream is composed of YOLOv5[15] for object detection, a vision encoder $\varphi$ for feature extraction, and spatial memory (SM) and temporal memory (TM) for storing the object appearance and motion information. Given a sequence of $d$ consecutive frames $F_{1},F_{2},\ldots,F_{d}\in\mathbb{R}^{3\times H\times W}$ , YOLOv5 generates a set of $n$ objects for each frame, $O_{1},O_{2},\ldots,O_{d}\in\mathbb{R}^{n\times 3\times H\times W}$ . Subsequently, the locally-aware features $l_{1},l_{2},\ldots,l_{d}\in\mathbb{R}^{n\times c\times h\times w}$ are produced that represent the features of the objects.
本地流。本地流由用于对象检测的 YOLOv5[15]、 $\varphi$ 用于特征提取的视觉编码器以及用于存储对象外观和运动信息的空间内存（SM）和时间内存（TM）组成。给定一个连续帧 $F_{1},F_{2},\ldots,F_{d}\in\mathbb{R}^{3\times H\times W}$ 序列 $d$ ，YOLOv5 会为每个帧生成一组 $n$ 对象。 $O_{1},O_{2},\ldots,O_{d}\in\mathbb{R}^{n\times 3\times H\times W}$ 随后，将生成表示对象特征的本地感知特征 $l_{1},l_{2},\ldots,l_{d}\in\mathbb{R}^{n\times c\times h\times w}$ 。

Specifically, average pooling is performed on the features from $\varphi_{2},\varphi_{3}$ and they are concatenated based on $\varphi_{2}$ . This strategy aims to integrate both fine-grained and coarse-grained information that is necessary for object-level VAD. However, as $\varphi_{2},\varphi_{3}$ have relatively small receptive fields, pooling is used to consolidate information over a wider range. The equation for constructing $l_{i}$ is as follows: \linenomathAMS
具体来说，对中的 $\varphi_{2},\varphi_{3}$ 特征执行平均池化，并根据进行连接 $\varphi_{2}$ 。此策略旨在集成对象级 VAD 所需的细粒度和粗粒度信息。然而，与相对较小的感受野一样 $\varphi_{2},\varphi_{3}$ ，池化用于在更宽的范围内整合信息。构造 $l_{i}$ 方程如下：\linenomathAMS

\displaystyle l_{i}=\langle f_{\text{ap}}(\varphi_{2}(O_{i})),f_{\text{ap}}(% \varphi_{3}(O_{i}))\rangle

(1)

where $f_{\text{ap}}$ represents the average pooling and $\langle\cdot,\cdot\rangle$ denotes tensor concatenation.
其中 $f_{\text{ap}}$ 表示平均池化， $\langle\cdot,\cdot\rangle$ 表示张量串联。

Subsequently, $l_{1:d}$ are reshaped into $lf\in\mathbb{R}^{n\times c\times d\times h\times w}$ . However, the dimensionality of $lf$ is impractically large for video processing. Therefore, we employ a method called split pooling to reduce the number of channels, transforming it into $\mathbb{R}^{n\times c^{\prime}\times d\times h\times w}$ ( $c\gg c^{\prime}$ ). The details of this approach can be found in Section 4.6. Subsequently, $lf$ is split into spatial patches and temporal patches through spatial partition and temporal partition, respectively, and stored in SM and TM.
随后， $l_{1:d}$ 被重塑为 $lf\in\mathbb{R}^{n\times c\times d\times h\times w}$ 。但是，对于视频处理来说，维度 $lf$ 不切实际地大。因此，我们采用一种称为 split pooling 的方法来减少通道的数量，将其转换为 $\mathbb{R}^{n\times c^{\prime}\times d\times h\times w}$ （） $c\gg c^{\prime}$ 。此方法的详细信息可以在 Section 4.6 中找到。随后， $lf$ 分别通过空间分区和时间分区拆分为空间补丁和时间补丁，并存储在 SM 和 TM 中。

Global Stream. The global stream constructs globally-aware features $g_{1},g_{2},\\ \ldots,g_{d}\in\mathbb{R}^{c\times 1\times 1}$ to capture the overall context of frames, thereby differing from the local stream, which focuses on individual objects within frames. We utilize $\varphi_{2},\varphi_{3}$ , applying global pooling to obtain global information. Specifically, average pooling is applied to the features of each layer, and these are concatenated based on $\varphi_{3}$ , which handles more global information. During global pooling, both average and max pooling techniques are employed to achieve a better representation. The equation for constructing $g_{i}$ is as follows: \linenomathAMS
全局流。全局流构造全局感知特征 $g_{1},g_{2},\\ \ldots,g_{d}\in\mathbb{R}^{c\times 1\times 1}$ 来捕获帧的整体上下文，因此与本地流不同，本地流侧重于帧内的单个对象。我们利用 $\varphi_{2},\varphi_{3}$ ，应用全局池来获取全局信息。具体来说，平均池化应用于每个图层的特征，这些特征基于 $\varphi_{3}$ 连接，从而处理更多的全局信息。在全局池化期间，同时使用平均池化和最大池化技术来实现更好的表示。构造 $g_{i}$ 方程如下：\linenomathAMS

	$\displaystyle g_{i}=\langle f_{\text{ap}}(\varphi_{2}(F_{i})),f_{\text{ap}}(% \varphi_{3}(F_{i}))\rangle$		(2)
	$\displaystyle g_{i}=f_{\text{gap}}(g_{i})+f_{\text{gmp}}(g_{i})$		(3)

where $f_{\text{gap}}$ and $f_{\text{gmp}}$ denote global average pooling and global max pooling, respectively.
其中 $f_{\text{gap}}$ 和 $f_{\text{gmp}}$ 分别表示 Global Average pooling 和 Global Max pooling。

Subsequently, $g_{1:d}$ are reshaped into $gf\in\mathbb{R}^{c\times d\times 1\times 1}$ . The $gf$ constructed in this manner includes the global information of each frame and is converted into high-level patches through high-level partition. Afterwards, the patch is stored in the High-level Semantic Memory (HSM).
随后， $g_{1:d}$ 被重塑为 $gf\in\mathbb{R}^{c\times d\times 1\times 1}$ 。以这种方式构建的 S $gf$ 包含了每帧的全局信息，并通过 high-level partitions 转换为 high-level patch。之后，补丁将存储在高级语义内存（HSM）中。

3.2 Patch Partition
3.2 补丁分区

Fig. 3 illustrates the proposed patch partition method, which consists of spatial partition, temporal partition, and high-level partition. The spatial partition focuses on object appearance information, generating patches while disregarding temporal information. The temporal partition emphasizes object motion information, generating patches while ignoring spatial information. The high-level partition utilizes extensive spatiotemporal information to generate patches for understanding the contextual information across frames. The detailed descriptions of each partition are as follows.
无花果。图 3 说明了所提出的 patch 分区方法，它由空间分区、时间分区和高级分区组成。空间分区侧重于对象外观信息，在生成补丁的同时忽略时态信息。时间分区强调对象运动信息，生成面片，同时忽略空间信息。高级分区利用广泛的时空信息来生成补丁，以便了解跨帧的上下文信息。每个分区的详细说明如下。

Spatial Partition. Appearance information is crucial for evaluating anomalies in objects and can be derived from spatial features. We compress the input feature map $lf$ using temporal global pooling[16] to preserve spatial information while ignoring temporal aspects. Subsequently, average pooling is applied to consider various regions of the object. For instance, when assessing a person’s pose, it is essential to consider multiple joints together. The equation for creating $SpatialPatches$ is as follows: \linenomathAMS
空间分区。外观信息对于评估对象中的异常至关重要，并且可以从空间特征中获取。我们使用时间全局池化[16] 压缩输入特征图 $lf$ ，以保留空间信息，同时忽略时间方面。随后，应用平均池化来考虑对象的各个区域。例如，在评估一个人的姿势时，必须同时考虑多个关节。创建 $SpatialPatches$ 方程如下：\linenomathAMS

\displaystyle SpatialPatches=f_{\text{ap}}(f_{\text{tgp}}(lf))

(4)

where $f_{\text{tgp}}$ denotes the temporal global pooling. Finally, the results are reshaped into $\mathbb{R}^{(n\cdot\hat{h}\cdot\hat{w})\times c^{\prime}}$ , where $h\geq\hat{h}$ and $w\geq\hat{w}$ .
其中 $f_{\text{tgp}}$ 表示临时全局池化。最后，结果被重塑为 $\mathbb{R}^{(n\cdot\hat{h}\cdot\hat{w})\times c^{\prime}}$ ，其中 $h\geq\hat{h}$ 和 $w\geq\hat{w}$ 。

Temporal Partition. Motion information in objects represents changes over time, making it crucial for VAD. We utilize the adjacent temporal information within $lf$ to generate motion features $mf$ . To achieve this, the feature differences are computed and the representative motion values are determined through global average pooling. The equation for creating $TemporalPatches$ is as follows: \linenomathAMS
Temporal Partition 的 Temporal Partition。对象中的运动信息表示随时间的变化，因此对于 VAD 至关重要。我们利用其中 $lf$ 的相邻时间信息来生成运动特征 $mf$ 。为了实现这一点，计算了特征差异，并通过全局平均池化确定了代表性运动值。创建 $TemporalPatches$ 方程如下：\linenomathAMS

	$\displaystyle mf_{(t)}=\|lf_{(t+1)}-lf_{(t)}\|$		(5)
	$\displaystyle TemporalPatches=f_{\text{gap}}(mf)$		(6)

$mf_{(t)}\in\mathbb{R}^{n\times c^{\prime}\times h\times w}$ represents the difference between the $t^{th}$ and $t+1^{th}$ time within $lf$ , and it belongs to $mf\in\mathbb{R}^{n\times c^{\prime}\times\hat{d}\times h\times w}$ , where $\hat{d}=d-1$ . Finally, the results are reshaped into $\mathbb{R}^{(n\cdot\hat{d})\times c^{\prime}}$ .
$mf_{(t)}\in\mathbb{R}^{n\times c^{\prime}\times h\times w}$ 表示 $t^{th}$ 和 $t+1^{th}$ 时间在 $lf$ 内之间的差异，它属于 $mf\in\mathbb{R}^{n\times c^{\prime}\times\hat{d}\times h\times w}$ ，其中 $\hat{d}=d-1$ 。最后，将结果重塑为 $\mathbb{R}^{(n\cdot\hat{d})\times c^{\prime}}$ 。

High-level Partition. In frames, the global context processes the relationship between the objects and the scene, and considers interactions between different objects, thereby enabling more accurate VAD. As $gf$ already encompasses the global context from a spatial perspective, temporal pyramid pooling is utilized to obtain high-level temporal information. Additionally, this approach secures multi-scale temporal information, addressing the limitation of only using adjacent temporal information in temporal partition methods. The equation for creating $HighlevelPatches$ is as follows: \linenomathAMS
高级分区。在帧中，全局上下文处理对象与场景之间的关系，并考虑不同对象之间的交互，从而实现更准确的 VAD。由于 $gf$ 已经从空间角度包含了全局上下文，因此利用时间金字塔池化来获取高级时间信息。此外，这种方法还保护了多尺度的时态信息，解决了在时态划分方法中仅使用相邻时态信息的限制。创建 $HighlevelPatches$ 方程如下：\linenomathAMS

\displaystyle HighlevelPatches=\sum_{l=0}^{L}f_{\text{mp}}^{l}(gf)

(7)

Temporal pyramid pooling is implemented using $f_{\text{mp}}^{l}$ . In this case, $f_{\text{mp}}^{l}$ represents applying the max pooling operation $l$ times. Finally, the results are reshaped into $\mathbb{R}^{d\times c}$ .
时间金字塔池化是使用 $f_{\text{mp}}^{l}$ 实现的。在本例中， $f_{\text{mp}}^{l}$ 表示应用最大池化操作 $l$ 时间。最后，将结果重塑为 $\mathbb{R}^{d\times c}$ 。

3.3 Inference 3.3 推理

Similar to the memorization stage, the Patch Partition method is applied as described earlier. At this time, mismatched pooling technology is used to improve the VAD performance. This technique and the method of calculating the anomaly score are as follows.
与记忆阶段类似，如前所述应用 Patch Partition 方法。此时，使用不匹配池技术来提高 VAD 性能。此技术和计算异常分数的方法如下。

Mismatched Pooling. During the temporal partition process, global pooling is applied differently in the memorization and inference stages. The memorization stage employs global average pooling (GAP), while global max pooling (GMP) is used during inference. As normal objects typically exhibit more static characteristics compared to abnormal ones, we expected normal motion features to show similar average and maximum values. In contrast, abnormal motion features were anticipated to show significant discrepancies in these values. By adopting this approach, VAD performance improves by preventing the retrieval of nearby temporal patches from memory when abnormal data is input. Experimental validation of the effectiveness of this methodology is detailed in Section 4.6.
不匹配的池。在时间分区过程中，全局池化在记忆和推理阶段的应用方式不同。记忆阶段采用全局平均池化（GAP），而在推理过程中使用全局最大池化（GMP）。由于与异常物体相比，正常物体通常表现出更多的静态特征，因此我们预计正常运动特征会显示相似的平均值和最大值。相比之下，预计异常运动特征会在这些值中表现出显着差异。通过采用这种方法，VAD 性能通过防止在输入异常数据时从内存中检索附近的临时补丁来提高。该方法有效性的实验验证在第 4.6 节中详细说明。

Anomaly Scoring. All nearest neighbors are computed between patches $X$ generated from test frames and the memory bank $M$ . Subsequently, the representative patch $x^{*}$ is selected from the $X$ and its nearest neighbor $m^{*}$ is selected from the $M$ . The distance $s^{*}$ between $x^{*}$ and $m^{*}$ is assigned as the score representing the frames. \linenomathAMS
异常评分。所有最近邻域都是在从 test frames 和 memory bank $M$ 生成的补丁 $X$ 之间计算的。随后，从中选择代表性补丁 $x^{*}$ ，并从中选择其最近的邻居 $m^{*}$ $M$ 。 $X$ 和 $m^{*}$ 之间的 $x^{*}$ 距离 $s^{*}$ 被分配为表示帧的分数。\linenomathAMS

	$\displaystyle x^{},m^{}=\underset{x\in X}{\mathrm{arg\,max}}\,\underset{m\in M% }{\mathrm{arg\,min}}\,\\|x-m\\|_{2}$		(8)
	$\displaystyle s^{}=\\|x^{}-m^{*}\\|_{2}$		(9)

As three types of memory are used, $s^{*}$ is computed three times for SM, TM, and HSM, respectively. The $s_{\text{spatial}}^{*}$ calculated from SM and $s_{\text{temporal}}^{*}$ calculated from TM determine the local anomaly score (LAS) in the local stream. Conversely, $s_{\text{high-level}}^{*}$ computed from HSM is utilized as the global anomaly score (GAS). Ultimately, the LAS and GAS combine to derive the anomaly score. The equation for calculating the anomaly score is as follows: \linenomathAMS
由于使用了三种类型的内存， $s^{*}$ 因此将分别对 SM、TM 和 HSM 计算 3 次。根据 SM $s_{\text{spatial}}^{*}$ 计算和 $s_{\text{temporal}}^{*}$ 从 TM 计算确定本地流中的局部异常评分（LAS）。相反， $s_{\text{high-level}}^{*}$ 从 HSM 计算得出的分数将用作全局异常分数（GAS）。最终，LAS 和 GAS 结合得出异常分数。计算异常分数的公式如下：\linenomathAMS

	$\displaystyle LAS=\delta_{1}\cdot s_{\text{spatial}}^{}+\delta_{2}\cdot s_{% \text{temporal}}^{}$		(10)
	$\displaystyle GAS=s_{\text{high-level}}^{*}$		(11)
	$\displaystyle AnomalyScore=\gamma_{1}\cdot\frac{LAS-\mu(LAS)}{\sigma(LAS)}+% \gamma_{2}\cdot\frac{GAS-\mu(GAS)}{\sigma(GAS)}$		(12)

4 Experiments 4 实验

4.1 Datasets 4.1 数据

The proposed model was validated on three datasets: CUHK Avenue (Avenue)[21], ShanghaiTech (SHTech)[40], and IITB Corridor (Corridor)[29]. As each dataset was constructed for OCC method, the training dataset consisted only of normal videos, while the test dataset contained both normal and abnormal videos.
所提出的模型在三个数据集上进行了验证：香港中文大学大道（Avenue）[21]、上海科技大学（SHTech）[40] 和 IITB 走廊（Corridor）[29]。由于每个数据集都是针对 OCC 方法构建的，因此训练数据集仅包含正常视频，而测试数据集包含正常视频和异常视频。

Avenue. The dataset consisted of 16 training and 21 testing videos filmed with a single camera at the same angle. It included anomaly events such as throwing paper, running on the road, and walking in the wrong direction.
大道。该数据集包括 16 个训练视频和 21 个测试视频，这些视频是用单个相机以相同角度拍摄的。它包括异常事件，例如扔纸、在路上奔跑和走错方向。

SHTech. The dataset consisted of 330 training videos and 107 testing videos, capturing 13 scenes from different angles. It included various anomaly events such as the appearance of anomaly objects (e.g., cars, bicycles) and action anomalies (e.g., fighting, jumping).
SHTech 的该数据集由 330 个训练视频和 107 个测试视频组成，从不同角度捕捉了 13 个场景。它包括各种异常事件，例如异常物体（例如，汽车、自行车）的出现和动作异常（例如，打斗、跳跃）。

Corridor. The dataset consisted of 208 training and 150 testing videos filmed with a single camera at the same angle. It included previously unseen anomaly events such as protesting, hiding, and playing with the ball.
走廊。该数据集包括 208 个训练视频和 150 个测试视频，这些视频是用单个相机以相同角度拍摄的。它包括以前看不见的异常事件，例如抗议、躲藏和玩球。

4.2 Implementation details
4.2 实现细节

Object Detection. The pretrained YOLOv5 was utilized for object detection in the videos. The bounding boxes of the objects were generated based on the last or middle frames among the $d$ input frames provided, and these bounding boxes were shared across other frames. Specifically, for the Avenue, SHTech, and Corridor datasets, we used 10, 4, 4 input frames, respectively. For the Avenue dataset, objects were detected based on the middle frame. Furthermore, to minimize the loss of temporal information, margins were added to the width and height of the bounding boxes, ensuring that they remained equal in size. Finally, the object images were resized to (224, 224) before being input into the pre-trained network.
对象检测。预训练的 YOLOv5 用于视频中的对象检测。对象的边界框是根据提供的 $d$ 输入帧中的最后一个帧或中间帧生成的，并且这些边界框在其他帧之间共享。具体来说，对于 Avenue、SHTech 和 Corridor 数据集，我们分别使用了 10、4、4 个输入帧。对于 Avenue 数据集，根据中间帧检测对象。此外，为了最大限度地减少时间信息的损失，在边界框的宽度和高度上添加了边距，以确保它们的大小保持相等。最后，在输入到预训练网络之前，将对象图像的大小调整为（224， 224）。

Memorizing details. We utilized CLIP’s ResNet-101 as the vision encoder. The value $c^{\prime}$ for split pooling in the local stream was determined by the dataset size. Specifically, it was set to 32 for the smaller Avenue dataset and 64 for the larger SHTech and Corridor datasets. For $lf$ and $gf$ generation, average pooling with a kernel size of $(3,3)$ and stride of 1 was used. In the spatial partition, average pooling was performed with a kernel size of $(4,4)$ . In the high-level partition, $L$ in TPP was set to 2, and during max pooling, a kernel size of $(2)$ with a stride of 1 was used. During memory optimization, coreset subsampling rates were set to 1, 25, and 10% for the Avenue, SHTech, and Corridor datasets, respectively.
记住细节。我们使用 CLIP 的 ResNet-101 作为视觉编码器。本地流中拆分池的值 $c^{\prime}$ 由数据集大小决定。具体来说，对于较小的 Avenue 数据集，它设置为 32，对于较大的 SHTech 和 Corridor 数据集，它设置为 64。对于 $lf$ 和 $gf$ 代，使用了内核大小和 $(3,3)$ stride 为 1 的平均池化。在空间分区中，使用核大小 . $(4,4)$ 在高级分区中， $L$ TPP 设置为 2，在最大池化期间，使用步幅为 1 的内核大小 $(2)$ 。在内存优化期间，Avenue、SHTech 和 Corridor 数据集的核心集子采样率分别设置为 1% 、 25% 和 10% 。

Table 1: Comparison of VPC against SOTA methods in terms of AUROC. Avenue: CUHK Avenue, SHTech: ShanghaiTech, Corridor: IITB Corridor. The best results are bolded. The second-best results are underlined. *implemented by us.
表 1：VPC 与 SOTA 方法在 AUROC 方面的比较。地点：香港中文大学大道，上海科技大学：上海科技大学，走廊： IITB 走廊.最佳结果以粗体显示。次好的结果是带下划线的。*由我们实施。

Method 方法	Venue 场地	Memory 记忆	Avenue 大道	SHTech SHTech 公司	Corridor 走廊
FFP[18] 常数排序[18]	CVPR 18		84.9%	72.8%	64.7%
MPED-RNN[24]	CVPR 19		-	73.4%	64.3%
MemAE[8] 内存AE[8]	ICCV 19	✓	83.3%	71.2%	-
AMC[25]	ICCV 19		86.9%	-	-
MTP[29]	WACV 20		82.9%	76.0%	67.1%
CDDA[5]	ECCV 20 ECCV 20 系列		86.0%	73.3%	-
MNAD[26]	CVPR 20 CVPR 20 号	✓	88.5%	70.5%	-
ROADMAP[34] 路线图[34]	TNNLS 21		88.3%	76.6%	-
AMMC-Net[3]	AAAI 21	✓	86.6%	73.7%	-
MPN[23]	CVPR 21 CVPR 21 号	✓	89.5%	73.8%	-
$\text{HF}^{2}\text{-VAD}$ [20]	ICCV 21	✓	91.1%	76.2%	-
LLSH[22]	TCSVT 22		87.4%	77.6%	73.5%
VABD[17]	TIP 22 提示 22		86.6%	78.2%	72.2%
DLAN-AC[37]	ECCV 22	✓	89.9%	74.7%	-
Jigsaw[33] 拼图[33]	ECCV 22		92.2%	84.3%	-
Sun et al.[31] Sun 等人[31]	AAAI 23	✓	91.5%	78.6%	-
Cao et al.[4] Cao 等人[4]	CVPR 23 CVPR 23 号		86.8%	79.2%	73.6%
USTN-DSC[36]	CVPR 23 CVPR 23 号		89.9%	73.8%	-
DMAD[19]	CVPR 23 CVPR 23 号	✓	92.8%	78.8%	-
FPDM[35] FPDM系列[35]	ICCV 23		90.1%	78.6%	-
$\text{STG-NF}^{*}$ [11]	ICCV 23		61.8%	85.9%	61.4%
HSC[32]	ICCV 23	✓	92.4%	83.0%	-
Ristea et al.[28] Ristea 等人[28]	CVPR 24 CVPR 24 号		91.3%	79.1%	-
Zhang et al.[39] Zhang 等人[39]	CVPR 24 CVPR 24 号		92.4%	85.1%	-
VPC (Ours) VPC（我们的）	-	✓	92.8%	85.1%	76.4%

Inference details. The anomaly score was smoothed using a 1D Gaussian filter during testing. The parameters $\delta_{1}$ and $\delta_{2}$ used in LAS computation were set to $(0.7,0.3)$ for Avenue and $(0.5,0.5)$ for the SHTech and Corridor datasets. The parameters $\gamma_{1}$ and $\gamma_{2}$ used in the final anomaly score computation were set to $(0.7,0.3)$ for Avenue and $(0.9,0.1)$ for the SHTech and Corridor datasets. The experiments were conducted using an NVIDIA GeForce RTX 3090 graphics card.
推理详细信息。在测试期间，使用 1D Gaussian 滤波器对异常分数进行平滑处理。LAS 计算中使用的参数 $\delta_{1}$ $\delta_{2}$ 设置为 $(0.7,0.3)$ Avenue 以及 $(0.5,0.5)$ SHTech 和 Corridor 数据集。最终异常分数计算中使用的参数 $\gamma_{1}$ 和 $\gamma_{2}$ 设置为 $(0.7,0.3)$ Avenue 以及 $(0.9,0.1)$ SHTech 和 Corridor 数据集。实验是使用 NVIDIA GeForce RTX 3090 显卡进行的。

4.3 Evaluation Criteria
4.3 评估标准

Consistent with prior research, the frame-level area under the curve (AUC) of Receiver Operation Characteristic (ROC) was used to evaluate the performance of the proposed model. This evaluation was calculated as the area under the ROC curve by varying the threshold. Specifically, we aggregated anomaly scores across all frames in the dataset to calculate the micro-averaged AUROC.
与之前的研究一致，使用接收者操作特性（ROC）的帧级曲线下面积（AUC）来评估所提出的模型的性能。该评估是通过改变阈值计算为 ROC 曲线下的面积。具体来说，我们汇总了数据集中所有帧的异常分数，以计算微平均 AUROC。

4.4 Comparison with SOTA models
4.4 与 SOTA 模型的比较

Tab. 1 presents the results of the proposed model’s performance on three benchmark datasets. It demonstrates the competitive performance compared to other state-of-the-art (SOTA) methods. Specifically, when compared to HSC[32], which utilizes appearance and motion memory, our approach outperforms by 0.4% and 2.1% on the Avenue and SHTech datasets, respectively, leveraging three memory components effectively. And, compared to FPDM[35], which also employs the CLIP, our model achieves better performance by accurately detecting anomaly events at the local stream. Furthermore, compared to the other methods using memory, a superior performance is achieved. This demonstrates that a method focused on memory optimization is more effective than traditional methods using memory. In addition, individuals without deep learning expertise can easily use the proposed approach because it does not require training.
表 1 显示了所提出的模型在三个基准数据集上的性能结果。它展示了与其他最先进的（SOTA）方法相比的竞争性能。具体来说，与利用外观和运动记忆的 HSC[32] 相比，我们的方法在 Avenue 和 SHTech 数据集上分别高出 0.4% 和 2.1%，有效地利用了三个内存组件。而且，与同样采用 CLIP 的 FPDM[35] 相比，我们的模型通过准确检测本地流中的异常事件实现了更好的性能。此外，与其他使用内存的方法相比，实现了卓越的性能。这表明，专注于内存优化的方法比使用内存的传统方法更有效。此外，没有深度学习专业知识的个人可以轻松使用所提出的方法，因为它不需要培训。

4.5 Qualitative Evaluation
4,5 定性评价

To intuitively understand the use of multiple memories, for the Avenue dataset, we compared PatchCore (PC), which utilizes appearance information, with
为了直观地了解多个记忆的用途，对于 Avenue 数据集，我们将利用外观信息的 PatchCore （PC）与
VideoPatchCore (VPC), which incorporates not only appearance information, but also motion and high-level information. Here, PC refers to the use of spatial memory only in VPC. Fig. 4(a) depicts an anomalous frame showing the action of throwing papers, necessitating consideration of motion information. Therefore, temporal memory plays a crucial role in this scenario. Fig. 4(b) depicts an anomalous frame showing the action of walking in the incorrect direction, necessitating consideration of both object and scene contexts. Therefore, high-level semantic memory plays a crucial role in this scenario. It is known that VPC, which utilizes temporal and high-level semantic memory, detects anomalies well, whereas PC, which does not utilize them, does not detect anomaly. Additional experiments on the SHTech and Corridor dataset are presented in the supplementary materials.
VideoPatchCore （VPC），它不仅包含外观信息，还包含运动和高级信息。此处，PC 仅指在 VPC 中使用空间内存。无花果。图4（a）描绘了一个异常的画面，显示了投掷纸张的动作，需要考虑运动信息。因此，时态记忆在这种情况下起着至关重要的作用。无花果。 4（b）描绘了一个异常帧，显示了向错误方向行走的动作，需要考虑物体和场景上下文。因此，高级语义记忆在这种情况下起着至关重要的作用。众所周知，利用时间和高级语义内存的 VPC 可以很好地检测异常，而不利用它们的 PC 不会检测到异常。补充材料中介绍了 SHTech 和 Corridor 数据集的其他实验。

4.6 Ablation study
4,6 消融研究

Table 2: Comparison of the AUROC scores for the spatial and temporal memory on the Avenue, SHTech and Corridor datasets.
表 2：Avenue、SHTech 和 Corridor 数据集上空间和时间记忆的 AUROC 分数比较。

Memory 记忆	Avenue 大道	SHTech SHTech 公司	Corridor 走廊
Spatial 空间	84.8%	74.7%	70.5%
Temporal 时间的	66.9%	78.8%	73.5%
Spatial+Temporal 空间 + 时间	90.3%	84.8%	76.3%

Table 3: Comparison of the AUROC scores for compression methods on the Avenue, SHTech and Corridor datasets.
表 3：Avenue、SHTech 和 Corridor 数据集上压缩方法的 AUROC 分数比较。

Compression 压缩	Avenue 大道	SHTech SHTech 公司	Corridor 走廊
Head 头	82.7%	76.9%	74.9%
Random 随机	85.9%	83.8%	74.8%
Split Pool 拆分池	90.3%	84.8%	76.3%

Spatial and Temporal memory. Tab. 2 presents a comparison of the VAD performance affected by the spatial and temporal memory. The spatial memory effectively detects the appearance of the anomaly object, while temporal memory effectively detects the action anomaly event. As it is important to consider both the motion and appearance information for accurate anomaly detection, using both types of memory helps achieve high performance on the Avenue, SHTech, and Corridor datasets.
空间和时间记忆。表 2 显示了受空间和时间记忆影响的 VAD 性能的比较。空间记忆有效地检测异常对象的外观，而时态记忆有效地检测动作异常事件。由于要同时考虑运动和外观信息以进行准确的异常检测非常重要，因此使用这两种类型的内存有助于在 Avenue、SHTech 和 Corridor 数据集上实现高性能。

Split pooling. Tab. 3 presents the AUROC scores based on information compression techniques in the local stream. Head compresses $lf$ by selecting $c^{\prime}$ channels from the front. Random compresses $lf$ by selecting random $c^{\prime}$ channels. Split pool divides $lf$ into $c^{\prime}$ groups and compresses by averaging the channels of each group. The common compression methods lose performance as they fail to preserve the original information, whereas the proposed compression method achieves high performance by effectively compressing the original information.
拆分池。表 3 显示了基于本地流中信息压缩技术的 AUROC 分数。通过从前面选择 $c^{\prime}$ 通道来压缩 $lf$ 头部。通过选择随机 $c^{\prime}$ 通道进行随机压缩 $lf$ 。Split pool （拆分池） $lf$ 划分为 $c^{\prime}$ 多个组，并通过平均每组的通道进行压缩。常见的压缩方法由于无法保留原始信息而失去性能，而所提出的压缩方法通过有效压缩原始信息来实现高性能。

Mismatched pooling. Tab. 4 presents the AUROC scores based on feature pooling methods in the temporal partition. The avg and max pool perform GAP and GMP during the memorization and inference stage, respectively. The common pooling methods fail to effectively detect action anomalies, whereas the proposed method achieves effective detection. This demonstrates that differentiating between normal and abnormal temporal patches by varying the pooling methods at each stage is effective for VAD.
池不匹配。表 4 显示了基于时间分区中特征池方法的 AUROC 分数。avg 和 max 池分别在记忆和推理阶段执行 GAP 和 GMP。常见的池化方法无法有效检测动作异常，而所提方法实现了有效检测。这表明，通过改变每个阶段的池化方法来区分正常和异常的颞部斑块对 VAD 是有效的。

Table 4: Comparison of AUROC scores for feature pooling methods in the temporal partition across the Avenue, SHTech and Corridor datasets.
表 4：Avenue、SHTech 和 Corridor 数据集的时间分区中特征池方法的 AUROC 分数比较。

Pooling 池	Avenue 大道	SHTech SHTech 公司	Corridor 走廊
Avg Pool 平均池	84.9%	76.1%	74.7%
Max Pool 最大池	88.2%	83.8%	75.7%
Mismatched Pool 不匹配的池	90.3%	84.8%	76.3%

Table 5: Comparison of the AUROC scores for the local and global stream on the Avenue, SHTech and Corridor datasets.
表 5：Avenue、SHTech 和 Corridor 数据集上本地和全球流的 AUROC 分数比较。

Stream 流	Avenue 大道	SHTech SHTech 公司	Corridor 走廊
Local 当地	90.3%	84.8%	76.3%
Global 全球	84.4%	68.4%	67.2%
Local+Global 本地 + 全球	92.8%	85.1%	76.4%

Local and Global stream. Tab. 5 presents a comparison of the anomaly detection performance affected by the local and global stream. When using only the local stream, it is difficult to detect anomalies based on scenes and abnormal behaviors among multiple objects. To address this issue, we incorporated the global stream, which utilizes a wide range of spatial-temporal information. When adding the global stream to Avenue, there is a significant performance improvement, whereas adding the global stream to SHTech and Corridor results in relatively slight performance improvements. In the latter datasets, there are many situations where objects are adjacent to each other, allowing the local stream alone to partially fulfill the role of the global stream.
本地和全球流。表 5 显示了受本地和全局流影响的异常检测性能的比较。当仅使用本地流时，很难根据场景和多个对象之间的异常行为进行检测。为了解决这个问题，我们合并了 global stream，它利用了广泛的时空信息。将全局流添加到 Avenue 时，性能会得到显著提升，而将全局流添加到 SHTech 和 Corridor 后，性能提升相对较小。在后一个数据集中，有很多情况是对象彼此相邻，允许单独的本地流部分履行全局流的角色。

4.7 Further Analysis
4,7 进一步分析

Coreset subsampling ratio. Tab. 6 presents the AUROC scores based on the subsampling ratio. Generally, increasing the sampling ratio in large datasets stores a diverse range of normal features in memory, leading to improved performance, but it can also result in slower processing speed. However, using Coreset subsampling, the performance difference in memory usage between 1% and 99% is very minimal, less than 0.5%, and the frames per second (FPS) can be up to 7.3 times faster. This demonstrates the effectiveness of coreset subsampling in large datasets such as SHTech and Corridor.
核心集子采样率。表 6 显示了基于子采样率的 AUROC 分数。通常，在大型数据集中增加采样率会在内存中存储各种正常特征，从而提高性能，但也可能导致处理速度变慢。但是，使用 Coreset 子采样时，1% 和 99% 之间的内存使用性能差异非常小，小于 0.5%，并且每秒帧数（FPS）可以提高 7.3 倍。这证明了 SHTech 和 Corridor 等大型数据集中核心集子采样的有效性。

Memorizing Technique. Tab. 7 presents the AUROC scores based on the subsampling order. The former method constructs memory from each video and concatenates them, whereas the latter method constructs memory from all videos collectively. We expected that the former method, which optimizes memory for each video, would be less constrained by hardware limitations but might suffer from a lower performance due to the uneven distribution of features. However, in fact, the performance difference between the two methods is almost nonexistent. This suggests that it is possible to sufficiently store a diverse range of normal features regardless of the order of subsampling.
记忆技巧。表 7 显示了基于子采样顺序的 AUROC 分数。前一种方法从每个视频中构建内存并将它们连接起来，而后一种方法从所有视频中集体构建内存。我们预计前一种方法（优化每个视频的内存）受硬件限制的限制较小，但由于功能分布不均匀，性能可能会降低。但实际上，两种方法之间的性能差异几乎不存在。这表明，无论子采样的顺序如何，都可以充分存储各种法向特征。

Table 6: Comparison of AUROC scores for the subsampling ratio on the SHTech and Corridor datasets.
表 6：SHTech 和 Corridor 数据集上子抽样率的 AUROC 分数比较。

Subsampling ratio 子采样率

SHTech SHTech 公司

Corridor 走廊

AUC

FPS

AUC

FPS

10%

25%

50%

75%

99%

84.6% (-0.5%)

85.0% (-0.1%)

85.1% (-0.0%)

85.1%

170.9

154.8

96.1

54.8

39.6

31.0

76.0% (-0.4%)

76.4% (-0.0%)

76.3% (-0.1%)

76.4%

143.4

113.5

60.0

40.0

25.2

19.5

Table 7: Comparison of AUROC scores for the memorizing technique at memory usage levels of both 10% and 99% on the SHTech and Corridor datasets.
表 7：在 SHTech 和 Corridor 数据集上，记忆技术在 10% 和 99% 的内存使用水平下的 AUROC 分数比较。

Memorizing technique 记忆技术	Ratio 率	SHTech SHTech 公司	Corridor 走廊
Subsampling 子采样	10%	85.0%	76.4%
$\rightarrow\text{concat}$	99%	85.1%	76.4%
Concat 连接	10%	84.9% (-0.1%)	76.3% (-0.1%)
$\rightarrow\text{subsampling}$	99%	85.0% (-0.1%)	76.4% (-0.0%)

5 Conclusion 5 结论

We present a VPC that extends PatchCore to the video level. The proposed model consists of three types of memory considering the characteristics of the video, and maintains high AUC and FPS through memory optimization using coreset subsampling. This approach can be used as an effective memory in VAD by solving three problems of traditional memory methods (increased optimization difficulty, complexity of implementation, and performance variability depending on the memory size). In addition, VPC shows excellent performance in all benchmark datasets and does not require training, increasing accessibility in the VAD field. We expect VPC to make a significant contribution to the VAD field.
我们提出了一个将 PatchCore 扩展到视频级别的 VPC。所提出的模型由三个考虑视频特性的内存类型，并通过保持高 AUC 和 FPS 使用 CoreSet Subsampling 的内存优化。这种方法可以用作有效的通过解决传统内存方法的三个问题（增加优化难度、实现复杂性和性能变化取决于内存大小）。此外，VPC 在所有基准测试数据集中都表现出优异的性能，并且不需要培训，增加 VAD 领域的可及性。我们预计 VPC 将做出重大贡献添加到 VAD 字段。

Acknowledgements. This research was supported by the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. RS-2023-00229822). The funders did not play any role in the design of the study, data collection, analysis, or preparation of the manuscript.
确认。这项研究得到了韩国政府（MSIT）资助的国家研究基金会（NRF）的支持（No.RS-2023-00229822）。资助者在研究设计、数据收集、分析或手稿准备方面没有发挥任何作用。

References 引用

[1] Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets (2007)
Agarwal， P.K.， Har-Peled， S.， Varadarajan， K.R.：通过核心集进行几何近似（2007）
[2] Astrid, M., Zaheer, M., Lee, J.Y., Lee, S.I.: Learning not to reconstruct anomalies. In: British Machine Vision Conference (2021)
Astrid， M.， Zaheer， M.， Lee， JY， Lee， S.I.：学会不重建异常。收录于：英国机器视觉会议（2021）
[3] Cai, R., Zhang, H., Liu, W., Gao, S., Hao, Z.: Appearance-motion memory consistency network for video anomaly detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 938–946 (2021)
Cai， R.， Zhang， H.， Liu， W.， Gao， S.， Hao， Z.：用于视频异常检测的外观-运动记忆一致性网络。收录于：人工智能 AAAI 会议论文集。第 35 卷，第 938–946 页（2021 年）
[4] Cao, C., Lu, Y., Wang, P., Zhang, Y.: A new comprehensive benchmark for semi-supervised video anomaly detection and anticipation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20392–20401 (2023)
Cao， C.， Lu， Y.， Wang， P.， Zhang， Y.：半监督视频异常检测和预测的新综合基准。收录于：计算机视觉和模式识别 IEEE/CVF 会议论文集。页码： 20392–20401 （2023）
[5] Chang, Y., Tu, Z., Xie, W., Yuan, J.: Clustering driven deep autoencoder for video anomaly detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. pp. 329–345. Springer (2020)
Chang， Y.， Tu， Z.， Xie， W.， Yuan， J.：用于视频异常检测的聚类驱动的深度自动编码器。在：计算机视觉-ECCV 2020：第 16 届欧洲会议，英国格拉斯哥，2020 年 8 月 23 日至 28 日，会议记录，第 XV 部分 16。第 329-345 页。施普林格（2020）
[6] Cohen, N., Hoshen, Y.: Sub-image anomaly detection with deep pyramid correspondences. ArXiv abs/2005.02357 (2020)
Cohen， N.， Hoshen， Y.：具有深金字塔对应关系的子图像异常检测。ArXiv abs/2005.02357 （2020）
[7] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
邓，J.，董，W.，索彻，R.，李，L.J.，李，K.，飞飞，L.：Imagenet：大规模分层图像数据库。收录于：2009 年 IEEE 计算机视觉和模式识别会议。第 248-255 页。IEEE认证（2009）
[8] Gong, D., Liu, L., Le, V., Saha, B., Mansour, M.R., Venkatesh, S., Hengel, A.v.d.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1705–1714 (2019)
Gong， D.， Liu， L.， Le， V.， Saha， B.， Mansour， M.R.， Venkatesh， S.， Hengel， AVD：记忆正常性以检测异常：用于无监督异常检测的记忆增强深度自动编码器。收录于：IEEE/CVF 计算机视觉国际会议论文集。第 1705–1714 页（2019）
[9] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural information processing systems 27 (2014)
Goodfellow，I.，Pouget-Abadie，J.，Mirza，M.，Xu，B.，Warde-Farley，D.，Ozair，S.，Courville，A.，Bengio，Y.：生成对抗网络。神经信息处理系统的进展 27 （2014）
[10] Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A.K., Davis, L.S.: Learning temporal regularity in video sequences. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 733–742 (2016)
Hasan， M.， Choi， J.， Neumann， J.， Roy-Chowdhury， AK， Davis， LS：学习视频序列中的时间规律。收录于：IEEE 计算机视觉和模式识别会议论文集。第 733–742 页（2016）
[11] Hirschorn, O., Avidan, S.: Normalizing flows for human pose anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13545–13554 (2023)
Hirschorn， O.， Avidan， S.：用于人类姿势异常检测的标准化流程。收录于：IEEE/CVF 计算机视觉国际会议论文集。第 13545–13554 页（2023）
[12] Hong, S., Ahn, S., Jo, Y., Park, S.: Making anomalies more anomalous: Video anomaly detection using a novel generator and destroyer. IEEE Access (2024)
Hong， S.， Ahn， S.， Jo， Y.， Park， S.：使异常更加异常：使用新型发生器和破坏器进行视频异常检测。IEEE Access （2024 年）
[13] Huang, C., Wen, J., Xu, Y., Jiang, Q., Yang, J., Wang, Y., Zhang, D.: Self-supervised attentive generative adversarial networks for video anomaly detection. IEEE transactions on neural networks and learning systems (2022)
黄婉，温， J.，徐彦，江， Q.，杨倩，王倍，张倩：用于视频异常检测的自我监督注意力生成对抗网络。IEEE Transactions on Neural Networks and Learning Systems （2022）（英语）
[14] Huang, C., Wu, Z., Wen, J., Xu, Y., Jiang, Q., Wang, Y.: Abnormal event detection using deep contrastive learning for intelligent video surveillance system. IEEE Transactions on Industrial Informatics 18(8), 5171–5179 (2021)
黄婉，吴， Z.，温， J.，徐， Y.，江， Q.，王Y.：使用深度对比学习进行智能视频监控系统的异常事件检测.IEEE 工业信息学汇刊 18（8）， 5171–5179 （2021）
[15] Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., Kwon, Y., Michael, K., Fang, J., Yifu, Z., Wong, C., Montes, D., et al.: ultralytics/yolov5: v7. 0-yolov5 sota realtime instance segmentation. Zenodo (2022)
Jocher，G.，Chaurasia，A.，Stoken，A.，Borovec，J.，Kwon，Y.，Michael，K.，Fang，J.，Yifu，Z.，Wong，C.，Montes，D.等：ultralytics/yolov5：v7。0-yolov5 SOTA 实时实例分割。芝诺堂（2022）
[16] Kwon, H., Kwak, S., Cho, M.: Video understanding via convolutional temporal pooling network and multimodal feature fusion. In: Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild. pp. 35–39 (2018)
Kwon， H.， Kwak， S.， Cho， M.：通过卷积时间池化网络和多模态特征融合进行视频理解。在：第 1 届研讨会的会议记录和野外综合视频理解挑战。第 35–39 页（2018）
[17] Li, J., Huang, Q., Du, Y., Zhen, X., Chen, S., Shao, L.: Variational abnormal behavior detection with motion consistency. IEEE Transactions on Image Processing 31, 275–286 (2021)
Li， J.， Huang， Q.， Du， Y.， Zhen， X.， Chen， S.， Shao， L.：具有运动一致性的变分异常行为检测。IEEE 图像处理汇刊 31， 275–286 （2021）
[18] Liu, W., Luo, W., Lian, D., Gao, S.: Future frame prediction for anomaly detection–a new baseline. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6536–6545 (2018)
Liu， W.， Luo， W.， Lian， D.， Gao， S.：异常检测的未来帧预测——新基线。收录于：IEEE 计算机视觉和模式识别会议论文集。第 6536–6545 页（2018）
[19] Liu, W., Chang, H., Ma, B., Shan, S., Chen, X.: Diversity-measurable anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12147–12156 (2023)
刘文，张， H.，马， B.，单， S.，陈， X.：多样性可测量异常检测。收录于：计算机视觉和模式识别 IEEE/CVF 会议论文集。页码： 12147–12156 （2023）
[20] Liu, Z., Nie, Y., Long, C., Zhang, Q., Li, G.: A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 13588–13597 (2021)
Liu， Z.， Nie， Y.， Long， C.， Zhang， Q.， Li， G.：通过记忆增强流重建和流导帧预测的混合视频异常检测框架。收录于：IEEE/CVF 计算机视觉国际会议论文集。第 13588–13597 页（2021）
[21] Lu, C., Shi, J., Jia, J.: Abnormal event detection at 150 fps in matlab. In: Proceedings of the IEEE international conference on computer vision. pp. 2720–2727 (2013)
Lu， C.， Shi， J.， Jia， J.：在 MATLAB 中以 150 fps 的速度进行异常事件检测。收录于：IEEE 计算机视觉国际会议论文集。第 2720–2727 页（2013）
[22] Lu, Y., Cao, C., Zhang, Y., Zhang, Y.: Learnable locality-sensitive hashing for video anomaly detection. IEEE Transactions on Circuits and Systems for Video Technology 33(2), 963–976 (2022)
Lu， Y.， Cao， C.， Zhang， Y.， Zhang， Y.：用于视频异常检测的可学习位置敏感哈希。IEEE 视频技术电路和系统汇刊 33（2）， 963–976 （2022）
[23] Lv, H., Chen, C., Cui, Z., Xu, C., Li, Y., Yang, J.: Learning normal dynamics in videos with meta prototype network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15425–15434 (2021)
Lv， H.， Chen， C.， Cui， Z.， Xu， C.， Li， Y.， Yang， J.：使用元原型网络学习视频中的正常动态。收录于：计算机视觉和模式识别 IEEE/CVF 会议论文集。页码： 15425–15434 （2021）
[24] Morais, R., Le, V., Tran, T., Saha, B., Mansour, M., Venkatesh, S.: Learning regularity in skeleton trajectories for anomaly detection in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11996–12004 (2019)
Morais， R.， Le， V.， Tran， T.， Saha， B.， Mansour， M.， Venkatesh， S.：学习骨骼轨迹的规律性以检测视频中的异常。收录于：计算机视觉和模式识别 IEEE/CVF 会议论文集。第 11996–12004 页（2019）
[25] Nguyen, T.N., Meunier, J.: Anomaly detection in video sequence with appearance-motion correspondence. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1273–1283 (2019)
Nguyen， T.N.， Meunier， J.：具有外观-运动对应关系的视频序列中的异常检测。收录于：IEEE/CVF 计算机视觉国际会议论文集。第 1273–1283 页（2019）
[26] Park, H., Noh, J., Ham, B.: Learning memory-guided normality for anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14372–14381 (2020)
Park， H.， Noh， J.， Ham， B.：学习记忆引导的常态性以进行异常检测。收录于：计算机视觉和模式识别 IEEE/CVF 会议论文集。第 14372–14381 页（2020）
[27] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
Radford， A.， Kim， JW， Hallacy， C.， Ramesh， A.， Goh， G.， Agarwal， S.， Sastry， G.， Askell， A.， Mishkin， P.， Clark， J.， Krueger， G.， Sutskever， I.：从自然语言监督中学习可转移的视觉模型。在：机器学习国际会议（2021）
[28] Ristea, N.C., Croitoru, F.A., Ionescu, R.T., Popescu, M., Khan, F.S., Shah, M., et al.: Self-distilled masked auto-encoders are efficient video anomaly detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15984–15995 (2024)
Ristea， N.C.， Croitoru， F.A.， Ionescu， R.T.， Popescu， M.， Khan， F.S.， Shah， M.， et al.：自蒸馏的掩蔽自动编码器是高效的视频异常检测器。收录于：IEEE/CVF 计算机视觉和模式识别会议论文集。第 15984–15995 页（2024）
[29] Rodrigues, R., Bhargava, N., Velmurugan, R., Chaudhuri, S.: Multi-timescale trajectory prediction for abnormal human activity detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2626–2634 (2020)
Rodrigues， R.， Bhargava， N.， Velmurugan， R.， Chaudhuri， S.：用于异常人类活动检测的多时间尺度轨迹预测。收录于：IEEE/CVF 计算机视觉应用冬季会议论文集。第 2626–2634 页（2020）
[30] Roth, K., Pemula, L., Zepeda, J., Scholkopf, B., Brox, T., Gehler, P.: Towards total recall in industrial anomaly detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 14298–14308 (2021)
Roth， K.， Pemula， L.， Zepeda， J.， Scholkopf， B.， Brox， T.， Gehler， P.：实现工业异常检测中的总召回。2022 年 IEEE/CVF 计算机视觉和模式识别会议（CVPR），第 14298–14308 页（2021 年）
[31] Sun, C., Shi, C., Jia, Y., Wu, Y.: Learning event-relevant factors for video anomaly detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2384–2392 (2023)
Sun， C.， Shi， C.， Jia， Y.， Wu， Y.：学习视频异常检测的事件相关因素。收录于：AAAI 人工智能会议论文集。第 37 卷，第 2384–2392 页（2023 年）
[32] Sun, S., Gong, X.: Hierarchical semantic contrast for scene-aware video anomaly detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22846–22856 (2023)
Sun， S.， Gong， X.：用于场景感知视频异常检测的分层语义对比。收录于：计算机视觉和模式识别 IEEE/CVF 会议论文集。页码： 22846–22856 （2023）
[33] Wang, G., Wang, Y., Qin, J., Zhang, D., Bao, X., Huang, D.: Video anomaly detection by solving decoupled spatio-temporal jigsaw puzzles. In: European Conference on Computer Vision. pp. 494–511. Springer (2022)
Wang， G.， Wang， Y.， Qin， J.， Zhang， D.， Bao， X.， Huang， D.：通过解决解耦时空拼图游戏进行视频异常检测。在：欧洲计算机视觉会议。第 494-511 页。施普林格（2022）
[34] Wang, X., Che, Z., Jiang, B., Xiao, N., Yang, K., Tang, J., Ye, J., Wang, J., Qi, Q.: Robust unsupervised video anomaly detection by multipath frame prediction. IEEE transactions on neural networks and learning systems 33(6), 2301–2312 (2021)
王晓晓，车， Z.，江斌，肖， N.，杨， K.，唐， J.，叶， J.，王， J.，齐， Q.：通过多路径帧预测实现稳健的无监督视频异常检测。IEEE 神经网络和学习系统汇刊 33（6）， 2301–2312 （2021）
[35] Yan, C., Zhang, S., Liu, Y., Pang, G., Wang, W.: Feature prediction diffusion model for video anomaly detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5527–5537 (2023)
Yan， C.， Zhang， S.， Liu， Y.， Pang， G.， Wang， W.：用于视频异常检测的特征预测扩散模型。收录于：IEEE/CVF 计算机视觉国际会议论文集。第 5527–5537 页（2023）
[36] Yang, Z., Liu, J., Wu, Z., Wu, P., Liu, X.: Video event restoration based on keyframes for video anomaly detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14592–14601 (2023)
Yang， Z.， Liu， J.， Wu， Z.， Wu， P.， Liu， X.：基于关键帧的视频事件恢复，用于视频异常检测。收录于：IEEE/CVF 计算机视觉和模式识别会议论文集。第 14592–14601 页（2023）
[37] Yang, Z., Wu, P., Liu, J., Liu, X.: Dynamic local aggregation network with adaptive clusterer for anomaly detection. In: European Conference on Computer Vision. pp. 404–421. Springer (2022)
Yang， Z.， Wu， P.， Liu， J.， Liu， X.：带有自适应集群器的动态本地聚合网络，用于异常检测。在：欧洲计算机视觉会议。第 404-421 页。施普林格（2022）
[38] Zaheer, M.Z., Lee, J.h., Astrid, M., Lee, S.I.: Old is gold: Redefining the adversarially learned one-class classifier training paradigm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14183–14193 (2020)
Zaheer， M.Z.， Lee， J.h.， Astrid， M.， Lee， S.I.：老即金：重新定义对抗性学习的单类分类器训练范式。收录于：IEEE/CVF 计算机视觉和模式识别会议论文集。第 14183–14193 页（2020）
[39] Zhang, M., Wang, J., Qi, Q., Sun, H., Zhuang, Z., Ren, P., Ma, R., Liao, J.: Multi-scale video anomaly detection by multi-grained spatio-temporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17385–17394 (2024)
张明明，王建军，齐建群，孙海，庄志强，任平，马荣，廖俊：通过多粒度时空表示学习进行多尺度视频异常检测。收录于：IEEE/CVF 计算机视觉和模式识别会议论文集。第 17385–17394 页（2024）
[40] Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 589–597 (2016)
张彦彦，周卿，陈倩，高倩，马倩們：通过多列卷积神经网络进行单图像人群计数。收录于：IEEE 计算机视觉和模式识别会议论文集。第 589–597 页（2016）
[41] Zhong, Y., Chen, X., Jiang, J., Ren, F.: A cascade reconstruction model with generalization ability evaluation for anomaly detection in videos. Pattern Recognition 122, 108336 (2022)
钟彦，陈晓，江，任， F.：一种具有泛化能力评估的级联重建模型，用于视频中的异常检测。模式识别 122， 108336 （2022）

VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection (Supplementary Materials)
VideoPatchCore：一种用于视频异常检测的正常性记忆的有效方法（补充材料）

A Qualitative Evaluation
一个定性评价

A.1 Anomaly Score Visualization
答 1 异常评分可视化

To intuitively understand the use of multiple memories, we compared PatchCore (PC), which utilizes appearance information, with VideoPatchCore (VPC), which incorporates not only appearance information, but also motion and high-level information, on the SHTech and Corridor datasets.
为了直观地了解多个记忆的使用，我们在 SHTech 和 Corridor 数据集上比较了利用外观信息的 PatchCore （PC）和 VideoPatchCore （VPC），后者不仅包含外观信息，还包含运动和高级信息。

Fig. 1(a) shows anomaly frames depicting the action of skateboarding, necessitating consideration of motion information. Therefore, temporal memory plays a crucial role in this scenario. Fig. 1(b) shows anomaly frames depicting the action of fighting, necessitating consideration of interactions between two peoples. Therefore, high-level semantic memory plays a crucial role in this scenario. It is known that VPC, which utilizes temporal and high-level semantic memory, detects anomalies well, whereas PC, which does not utilize them, does not detect anomalies.
无花果。 1（a）显示了描绘滑板动作的异常帧，需要考虑运动信息。因此，时态记忆在这种情况下起着至关重要的作用。无花果。 1（b）显示了描绘战斗动作的异常帧，需要考虑两个民族之间的互动。因此，高级语义记忆在这种情况下起着至关重要的作用。众所周知，利用时态和高级语义内存的 VPC 可以很好地检测异常，而不利用它们的 PC 不会检测到异常。

Fig. 2(a) shows anomaly frames depicting the action of sudden running, necessitating consideration of motion information. Therefore, temporal memory plays a crucial role in this scenario. Fig. 2(b) shows anomaly frames depicting the action of moving a suspicious object, necessitating consideration of relationship between the person and object. Therefore, high-level semantic memory plays a crucial role in this scenario. VPC, which employs high-level and temporal memory, effectively distinguishes between abnormal and normal frames. In contrast, PC tends to produce false positives.
无花果。图2（a）显示了描述突然奔跑动作的异常帧，需要考虑运动信息。因此，时态记忆在这种情况下起着至关重要的作用。无花果。图2（b）显示了异常帧，描绘了移动可疑物体的动作，需要考虑人与物体之间的关系。因此，高级语义记忆在这种情况下起着至关重要的作用。VPC 采用高级和临时内存，可有效区分异常帧和正常帧。相比之下，PC 往往会产生误报。

A.2 Object-wise Anomaly Score Visualization
答 2 对象级异常评分可视化

For a deeper understanding of memory effectiveness, we computed anomaly scores using each memory module for two specific objects shown in Fig. S3. One is the most anomalous object, and the other is the most normal object, as determined by the proposed model. Each object is characterized by the spatial (S) and temporal anomaly scores (T), while frames containing these objects are assigned a high-level anomaly score (H). The experimental results demonstrate precise prediction of anomalous objects within frames by the proposed model, with each memory module effectively fulfilling its role in various scenarios. For instance, in the bicycle scenario involving abnormal appearance, S increases due to spatial memory, while in the running scenario with abnormal behavior, T increases due to temporal memory. Finally, challenging anomalies such as "wrong direction" in the local stream are detected by H increasing in high-level memory. This validates the effectiveness of the three memory modules in VAD.
为了更深入地了解内存有效性，我们使用每个内存模块计算了图 1 中所示的两个特定对象的异常分数。 S3 中。一个是最异常的物体，另一个是最正常的物体，由提议的模型确定。每个对象都由空间（S）和时间异常分数（T）表征，而包含这些对象的帧则被分配一个高级异常分数（H）。实验结果表明，所提模型对帧内异常目标进行了精确预测，每个内存模块在各种场景中都有效地发挥了其作用。例如，在涉及异常外观的自行车场景中，S 因空间记忆而增加，而在行为异常的跑步场景中，T 因时态记忆而增加。最后，通过高级内存中的 H 增加来检测本地流中的 “错误方向” 等具有挑战性的异常。这验证了 VAD 中三个内存模块的有效性。

A.3 Patch Visualization
答 3 补丁可视化

Fig. S4 depicts the t-SNE plots of normal patches stored in memory (denoted by green ’*’), along with normal (blue ’o’) and anomalous patches (red ’o’) from the test data. The results show that normal patches are clustered closely together compared to anomalous patches, aligning closely with the distribution of memory patches. In contrast, anomalous patches exhibit a wider dispersion and tend to be farther away from the memory patches. These findings suggest that the proposed memory effectively stores the normalcy of videos, making it suitable for VAD.
无花果。 S4 描述了存储在内存中的正常补丁（用绿色“*”表示）以及测试数据中的正常（蓝色“o”）和异常补丁（红色“o”）的 t-SNE 图。结果表明，与异常补丁相比，正常补丁紧密聚集在一起，与内存补丁的分布密切相关。相比之下，异常补丁表现出更广泛的分散性，并且往往离记忆补丁更远。这些发现表明，所提出的内存有效地存储了视频的正常性，使其适用于VAD。

Table S1: Comparison of AUROC scores for all memory banks on the Avenue, SHTech and Corridor datasets. The best results are red and the second best results are blue.
表 S1：Avenue、SHTech 和 Corridor 数据集上所有内存库的 AUROC 分数比较。最佳结果是红色，第二好的结果是蓝色。

Avenue 大道	1%	10%	25%	50%	75%	99%
Spatial 空间	0.848	0.831	0.828	0.825	0.828	0.828
Temporal 时间的	0.669	0.669	0.669	0.669	0.669	0.669
High-level 高级	0.844	0.845	0.848	0.844	0.844	0.844
Total 总	0.928	0.918	0.914	0.912	0.912	0.912

SHTech SHTech 公司	1%	10%	25%	50%	75%	99%
Spatial 空间	0.748	0.744	0.747	0.747	0.746	0.746
Temporal 时间的	0.788	0.788	0.788	0.788	0.788	0.788
High-level 高级	0.671	0.675	0.684	0.673	0.673	0.674
Total 总	0.846	0.850	0.851	0.851	0.851	0.851

Corridor 走廊	1%	10%	25%	50%	75%	99%
Spatial 空间	0.690	0.705	0.705	0.705	0.706	0.705
Temporal 时间的	0.735	0.735	0.735	0.735	0.735	0.735
High-level 高级	0.664	0.672	0.673	0.674	0.675	0.660
Total 总	0.760	0.764	0.763	0.763	0.763	0.764

B Quantitative Evaluation
乙定量评估

B.1 Detailed Analysis of Subsampling Ratio
B.1 子抽样比例详细分析

We conducted a detailed analysis of performance changes based on subsampling ratios in the Avenue, Shanghai, and Corridor datasets as shown in Tab. S1. In the SHTech and Corridor datasets, the performance was better when using more memory, whereas in the Avenue dataset, the performance tended to be higher with less memory.
我们根据 Avenue、Shanghai 和 Corridor 数据集中的子采样率对性能变化进行了详细分析，如表 S1 所示。在 SHTech 和 Corridor 数据集中，使用更多内存时性能更好，而在 Avenue 数据集中，内存较少时性能往往更高。

These results can be explained with two reasons. First, due to the diversity of normal data in the SHTech and Corridor datasets, the performance improves as more normal patches are stored in memory. Second, the Avenue dataset mainly consists of action anomalies, making it challenging to distinguish them from normal instances without using temporal information. Therefore, from a spatial memory perspective, filtering out normal patches that resemble anomalies enhances the performance. Meanwhile, other memories that utilize temporal information exhibit robust performance regardless of their size. Consequently, using only 1% of memory overall yields the best performance.
这些结果可以用两个原因来解释。首先，由于 SHTech 和 Corridor 数据集中法线数据的多样性，随着内存中存储的法线图块越多，性能也会提高。其次，Avenue 数据集主要由动作异常组成，这使得在不使用时间信息的情况下将它们与正常实例区分开来具有挑战性。因此，从空间内存的角度来看，过滤掉类似于异常的正常补丁可以提高性能。同时，其他利用时间信息的内存无论大小如何，都表现出强大的性能。因此，总体上仅使用 1% 的内存即可产生最佳性能。

However, as evidenced by the experimental results, the performance difference between using 10% and 99% of memory is very small. Therefore, in practical use, sufficiently good performance can be maintained even with memory usage set at 10% or lower.
然而，实验结果证明，使用 10% 和 99% 的内存之间的性能差异非常小。因此，在实际使用中，即使内存使用率设置为 10% 或更低，也可以保持足够好的性能。

VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly DetectionVideoPatchCore：一种用于视频异常检测的正常性的有效方法

Abstract 抽象

Keywords:

关键字：

1 Introduction 1 介绍

2 Related Work 阿拉伯数字 相关工作

2.1 Reconstruction & prediction-based Video Anomaly Detection2.1 基于重建和预测的视频异常检测

2.2 Video Anomaly Detection using Memory2.2 元使用内存进行视频异常检测

2.3 Representation-based Image Anomaly Detection2.3 基于表示的图像异常检测

3 Method 3 方法

3.1 Overview 3.1 概述

3.2 Patch Partition3.2 补丁分区

3.3 Inference 3.3 推理

4 Experiments 4 实验

4.1 Datasets 4.1 数据

4.2 Implementation details4.2 实现细节

4.3 Evaluation Criteria4.3 评估标准

4.4 Comparison with SOTA models4.4 与 SOTA 模型的比较

4.5 Qualitative Evaluation4,5 定性评价

4.6 Ablation study4,6 消融研究

4.7 Further Analysis4,7 进一步分析