A Benchmark for Controllable Text -Image-to-Video Generation

IEEE TRANSACTIONS ON MULTIMEDIA（2024）

引用 0|浏览7

暂无评分

摘要

Automatic video generation is a challenging research topic, attracting interests from different perspectives, including Image-to-Video generation (I2V), Video-to-Video generation (V2V), and Text-to-Video generation (T2V). To pursue more controllable and fine-grained video generation, a novel video generation task, named Text-Image-to-Video generation (TI2V), and a corresponding baseline solution, named Motion Anchor-based video Generator (MAGE), were proposed. However, two other factors, namely clean datasets and reliable evaluation metrics, also play important roles in the success of the TI2V task. In this article, we present a complete benchmark for the TI2V task which includes synthetic video-text paired datasets, a baseline method, and two evaluation metrics. More specifically: (1) Two versions of synthetic datasets are built based on CATER containing rich combinations of objects and actions, as well as the resulting changes of brightness and shadow. We also provide both explicit and ambiguous text descriptions to support deterministic and diverse video generation, respectively. (2) A refined version of MAGE, dubbed MAGE+, is proposed with an innovative motion anchor structure to store appearance-motion aligned representation, which can be further injected with explicit condition and implicit randomness to model the uncertainty in data distribution. (3) To evaluate the quality of generated video especially given ambiguous description, we introduce action precision and referring expression precision to assess the quality of motion based on captioning-and-matching method. Experiments conducted on proposed datasets, as well as relevant datasets, verify the effectiveness of our baseline and show appealing potentials of TI2V task.

查看译文

关键词

Video generation,text-image-to-video,multimodal-conditioned generation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要