Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection

MM '23: Proceedings of the 31st ACM International Conference on Multimedia(2023)

引用 0|浏览9
暂无评分
摘要
Spatio-temporal Action Detection (SAD) aims to recognize the multi-class actions, and meanwhile locate their spatio-temporal occurrence in untrimmed videos. Besides relying on the inherent inter-actor interactions, most previous SAD approaches model actor interactions between multi-actors and the whole frames or special parts (e.g., objects/hands). However, such approaches are relatively graceless by 1) roughly treating all various actors to equivalently interact with frames/parts or by 2) sumptuously borrowing multiple costly detectors to acquire the special parts. To solve the above dilemma, we propose a novel Foreground/Background-masked Interaction Learning (dubbed as FBI Learning) framework to learn the multi-actor features by attentively interacting with the hands-down foreground and background frames. Specifically, we first design a new Mask-guided Cross Attention (MCA) mechanism that calculates the masked cross-attentions to capture the compact relations between the actors and foreground/background regions. Next, we present a new Actor-guided Feature Aggregation (AFA) scheme that integrates foreground- and background-interacted actor features with the learnable actor-based weights. Finally, we construct a long-term feature bank that associates temporal context information to facilitate action classification. Extensive experiments are conducted on commonly available UCF101-24, MultiSports, and AVA v2.1/v2.2 datasets, which illustrate the competitive performance of FBI Learning against the state-of-the-art methods.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要