Multi-entity Video Transformers for Fine-Grained Video Representation Learning.
CoRR(2023)
摘要
The area of temporally fine-grained video representation learning aims to
generate frame-by-frame representations for temporally dense tasks. In this
work, we advance the state-of-the-art for this area by re-examining the design
of transformer architectures for video representation learning. A salient
aspect of our self-supervised method is the improved integration of spatial
information in the temporal pipeline by representing multiple entities per
frame. Prior works use late fusion architectures that reduce frames to a single
dimensional vector before any cross-frame information is shared, while our
method represents each frame as a group of entities or tokens. Our Multi-entity
Video Transformer (MV-Former) architecture achieves state-of-the-art results on
multiple fine-grained video benchmarks. MV-Former leverages image features from
self-supervised ViTs, and employs several strategies to maximize the utility of
the extracted features while also avoiding the need to fine-tune the complex
ViT backbone. This includes a Learnable Spatial Token Pooling strategy, which
is used to identify and extract features for multiple salient regions per
frame. Our experiments show that MV-Former not only outperforms previous
self-supervised methods, but also surpasses some prior works that use
additional supervision or training data. When combined with additional
pre-training data from Kinetics-400, MV-Former achieves a further performance
boost. The code for MV-Former is available at
https://github.com/facebookresearch/video_rep_learning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要