TMA-Net: a transformer-based multi-modal attention network for abnormal behavior detection

International Journal of Artificial Intelligence

TMA-Net: a transformer-based multi-modal attention network for abnormal behavior detection

Abstract

Abnormal behavior detection in crowded environments remains challenging due to complex motion patterns, occlusions, and domain variability. This paper presents transformer-based multi-modal attention network (TMA-Net), a unified framework that integrates red, green, and blue (RGB), optical flow (OF), and heat map (HM) modalities through a dual-stage attention fusion mechanism. The system employs you only look once version 11 (YOLOv11) for human localization and vision transformer (ViT)-B/16 for feature encoding, followed by intra-modal self-attention and cross-modal fusion to capture fine-grained spatial–temporal and motion energy dependencies. Extensive experiments on six public benchmarks as UMN, Crowd-11, UBNormal, ShanghaiTech, CUHK Avenue, UCSD Ped2, and EPUAbN dataset, demonstrate that TMA-Net achieves up to 97.5% area under the curve (AUC) and 96–100% accuracy, outperforming previous other state-of-the-art approaches. These results highlight the framework’s strong generalization and robustness across both single- and cross-dataset evaluations, underscoring its potential for reliable deployment in real intelligent surveillance systems.

Discover Our Library

Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.

Explore Now
Library 3D Ilustration