MovieChat+: Question-aware Sparse Memory for Long Video Question Answering
arxiv(2024)
摘要
Recently, integrating video foundation models and large language models to
build a video understanding system can overcome the limitations of specific
pre-defined vision tasks. Yet, existing methods either employ complex
spatial-temporal modules or rely heavily on additional perception models to
extract temporal features for video understanding, and they only perform well
on short videos. For long videos, the computational complexity and memory costs
associated with long-term temporal connections are significantly increased,
posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory
model, with tokens in Transformers being employed as the carriers of memory in
combination with our specially designed memory mechanism, we propose MovieChat
to overcome these challenges. We lift pre-trained multi-modal large language
models for understanding long videos without incorporating additional trainable
temporal modules, employing a zero-shot approach. MovieChat achieves
state-of-the-art performance in long video understanding, along with the
released MovieChat-1K benchmark with 1K long video, 2K temporal grounding
labels, and 14K manual annotations for validation of the effectiveness of our
method. The code along with the dataset can be accessed via the following
https://github.com/rese1f/MovieChat.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要