Learning Temporal Dynamics in Videos With Image Transformer

IEEE Transactions on Multimedia(2024)

引用 0|浏览6
暂无评分
摘要
Temporal dynamics represent the evolving of video content over time, which are critical for action recognition. In this paper, we ask the question: can the off-the-shelf image transformer architecture learn temporal dynamics in videos? To this end, we propose Multidimensional Stacked Image (MSImage) as a new arrangement of video data, which can be fed to image transformers. Technically, MSImage is a high-resolution image that is composed of several evenly-sampled video clips stacked along the channel and space dimensions. The frames in each clip are concatenated along the channel dimension for the transformers to infer short-term dynamics. Meanwhile, the clips are arranged on different spatial positions for learning long-term dynamics. On this basis we propose MSImageFormer-a new variant of image transformer that takes MSImage as the input and is jointly optimized by video classification loss and a new dynamics enhancement loss. The network optimization attends to the high-frequency component of MSImage, avoiding overfitting to static visual patterns. We empirically demonstrate the merits of the MSImageFormer on six action recognition benchmarks. With only 2D image transformer as the classifier, our MSImageFormer achieves 85.3% and 69.7% top-1 accuracy on Kinetics-400 and Something-Something V2 datasets, respectively. Despite with fewer computations, the results are comparable to the SOTA 3D CNNs and video transformers.
更多
查看译文
关键词
video action recognition,neural networks,vision transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要