Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

Shengyang Sun1, Jiashen Hua2, Junyi Feng2, Xiaojin Gong1*
1College of Information Science & Electronic Engineering, Zhejiang University
2Alibaba Cloud

*Indicates Corresponding Author
Code (Coming Soon) arXiv (Coming Soon)

We propose a novel text-guided weakly supervised multimodal video anomaly detection (TG-MVAD) framework. In detail, we introduce a multi-stage text augmentation (MSTA) mechanism to generate high-quality anomaly text samples, counteracting training biases, and obtaining a text feature extractor that is better suited for anomaly detection. Additionally, we present a multi-scale bottleneck transformer (MSBT) fusion module to enhance multimodal integration, utilizing a set of reduced bottleneck tokens to progressively transmit compressed information across modalities.

Demo of Anomalies Detected by TG-MVAD

Our proposed TG-MVAD introduces the text modality for anomaly detection, enhancing explainability while achieving strong performance.

Abstract

In recent years, weakly supervised multimodal video anomaly detection, which leverages RGB, optical flow, and audio modalities, has garnered significant attention from researchers, emerging as a vital subfield within video anomaly detection. However, previous studies have inadequately explored the role of text modality in this domain. With the proliferation of large-scale text-annotated video datasets and the advent of video captioning models, obtaining text descriptions from videos has become increasingly feasible. Text modality, carrying explicit semantic information, can more accurately characterize events within videos and identify anomalies, thereby enhancing the model's detection capabilities and reducing false alarms. To investigate the impact of text modality on video anomaly detection, we propose a novel text-guided weakly supervised multimodal video anomaly detection framework. Specifically, we introduce an in-context learning based multi-stage text augmentation mechanism to generate high-quality anomaly text samples, counteracting training biases, and obtaining a text feature extractor that is better suited for anomaly detection. Additionally, we present a multi-scale bottleneck transformer fusion module to enhance multimodal integration, utilizing a set of reduced bottleneck tokens to progressively transmit compressed information across modalities. Experimental results on large-scale datasets UCF-Crime and XD-Violence demonstrate that our proposed approach achieves state-of-the-art performance.

Multi-stage Textual Augmentation (MSTA)

We introduce an in-context learning (ICL) based multi-stage text augmentation (MSTA) mechanism aimed at generating more high-quality abnormal text samples to counteract the bias of the text feature extractor fine-tuning.

MSBT-based Multimodal Video Anomaly Detection

We propose a multi-scale bottleneck transformer (MSBT) module that improves inter-modality integration. This module employs a reduced set of bottleneck tokens to progressively convey condensed information between modalities, effectively capturing complex dependencies.The proposed MSBT-based text-guided multimodal video anomaly detection (TG-MVAD) framework is shown below.

BibTeX

BibTex Code Here