Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Shengyang Sun¹, Jiashen Hua², Junyi Feng², Xiaojin Gong^1*

¹College of Information Science & Electronic Engineering, Zhejiang University
²Alibaba Cloud
^*Indicates Corresponding Author

Code (Coming Soon) arXiv (Coming Soon)

We propose a novel text-guided weakly supervised multimodal video anomaly detection (TG-MVAD) framework. In detail, we introduce a multi-stage text augmentation (MSTA) mechanism to generate high-quality anomaly text samples, counteracting training biases, and obtaining a text feature extractor that is better suited for anomaly detection. Additionally, we present a multi-scale bottleneck transformer (MSBT) fusion module to enhance multimodal integration, utilizing a set of reduced bottleneck tokens to progressively transmit compressed information across modalities.

Demo of Anomalies Detected by TG-MVAD

Our proposed TG-MVAD introduces the text modality for anomaly detection, enhancing explainability while achieving strong performance.

Abstract

In recent years, weakly supervised multimodal video anomaly detection, which leverages RGB, optical flow, and audio modalities, has garnered significant attention from researchers, emerging as a vital subfield within video anomaly detection. However, previous studies have inadequately explored the role of text modality in this domain. With the proliferation of large-scale text-annotated video datasets and the advent of video captioning models, obtaining text descriptions from videos has become increasingly feasible. Text modality, carrying explicit semantic information, can more accurately characterize events within videos and identify anomalies, thereby enhancing the model's detection capabilities and reducing false alarms. To investigate the impact of text modality on video anomaly detection, we propose a novel text-guided weakly supervised multimodal video anomaly detection framework. Specifically, we introduce an in-context learning based multi-stage text augmentation mechanism to generate high-quality anomaly text samples, counteracting training biases, and obtaining a text feature extractor that is better suited for anomaly detection. Additionally, we present a multi-scale bottleneck transformer fusion module to enhance multimodal integration, utilizing a set of reduced bottleneck tokens to progressively transmit compressed information across modalities. Experimental results on large-scale datasets UCF-Crime and XD-Violence demonstrate that our proposed approach achieves state-of-the-art performance.

First image description.

imbalanced text samples

The text descriptions of anomalous events are often scarce and may contain ambiguities, we observed that directly applying the text features extracted by the vanilla feature extractor to anomaly detection did not yield satisfactory performance. This observation encourages us to fine-tune a text feature extractor for the specific task of video anomaly detection. However, the scarcity of anomalous events results in a predominance of normal text samples within the video data. Using the UCF-Crime dataset as a case study, we observe that only 12% of the text samples are classified as anomalies. This imbalance in the dataset introduces bias during the feature extractor fine-tuning. To address this issue, we propose the multi-stage text augmentation (MSTA) approach, which generates a greater number of high-quality anomaly samples to address the challenge of sample imbalance.

Multi-stage Textual Augmentation (MSTA)

We introduce an in-context learning (ICL) based multi-stage text augmentation (MSTA) mechanism aimed at generating more high-quality abnormal text samples to counteract the bias of the text feature extractor fine-tuning.

Illustration of the proposed multi-stage text augmentation (MSTA). (a) Stage-I: We use an LLM to summarize all captions in the videos, obtaining labeled text samples. (b) Stage-II: based on the summarized captions, we utilize ICL to generate anomaly scores as pseudo-labels for each caption within the video. (c) Stage-III: We employ the labeled samples from the previous two stages, using ICL to generate new anomalous samples.

MSBT-based Multimodal Video Anomaly Detection

We propose a multi-scale bottleneck transformer (MSBT) module that improves inter-modality integration. This module employs a reduced set of bottleneck tokens to progressively convey condensed information between modalities, effectively capturing complex dependencies.The proposed MSBT-based text-guided multimodal video anomaly detection (TG-MVAD) framework is shown below.

An overview of the proposed framework. It includes three unimodal encoders, a multi-scale bottleneck transformer, and a global encoder for multimodal feature generation. Each unimodal encoder consists of a modality-specific feature extractor and a linear projection layer for tokenization and a modality-shared transformer for context aggregation within one modality. The multi-scale bottleneck transformer (MSBT) fuses any pair of modalities and a sub-module to weight concatenated fused features. The global encoder, implemented by a transformer, aggregates context overall snippets. Finally, the final anomaly score is constructed by combining the anomaly scores from the fused features and the anomaly probabilities from the text.

BibTeX

BibTex Code Here