Our proposed TG-MVAD introduces the text modality for anomaly detection, enhancing explainability while achieving strong performance.
In recent years, weakly supervised multimodal video anomaly detection, which leverages RGB, optical flow, and audio modalities, has garnered significant attention from researchers, emerging as a vital subfield within video anomaly detection. However, previous studies have inadequately explored the role of text modality in this domain. With the proliferation of large-scale text-annotated video datasets and the advent of video captioning models, obtaining text descriptions from videos has become increasingly feasible. Text modality, carrying explicit semantic information, can more accurately characterize events within videos and identify anomalies, thereby enhancing the model's detection capabilities and reducing false alarms. To investigate the impact of text modality on video anomaly detection, we propose a novel text-guided weakly supervised multimodal video anomaly detection framework. Specifically, we introduce an in-context learning based multi-stage text augmentation mechanism to generate high-quality anomaly text samples, counteracting training biases, and obtaining a text feature extractor that is better suited for anomaly detection. Additionally, we present a multi-scale bottleneck transformer fusion module to enhance multimodal integration, utilizing a set of reduced bottleneck tokens to progressively transmit compressed information across modalities. Experimental results on large-scale datasets UCF-Crime and XD-Violence demonstrate that our proposed approach achieves state-of-the-art performance.
We introduce an in-context learning (ICL) based multi-stage text augmentation (MSTA) mechanism aimed at generating more high-quality abnormal text samples to counteract the bias of the text feature extractor fine-tuning.
We propose a multi-scale bottleneck transformer (MSBT) module that improves inter-modality integration. This module employs a reduced set of bottleneck tokens to progressively convey condensed information between modalities, effectively capturing complex dependencies.The proposed MSBT-based text-guided multimodal video anomaly detection (TG-MVAD) framework is shown below.
BibTex Code Here