VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Yue Fan,Xiaojian Ma, Rujie Wu,Yuntao Du, Jiaqi Li, Zhi Gao,Qing Li

CoRR(2024)

引用 0|浏览4
暂无评分
摘要
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6 NExT-QA and 26.0 open-sourced models and private counterparts including Gemini 1.5 Pro.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要