Generative Video Diffusion for Unseen Cross-Domain Video Moment Retrieval
CoRR(2024)
摘要
Video Moment Retrieval (VMR) requires precise modelling of fine-grained
moment-text associations to capture intricate visual-language relationships.
Due to the lack of a diverse and generalisable VMR dataset to facilitate
learning scalable moment-text associations, existing methods resort to joint
training on both source and target domain videos for cross-domain applications.
Meanwhile, recent developments in vision-language multimodal models pre-trained
on large-scale image-text and/or video-text pairs are only based on coarse
associations (weakly labelled). They are inadequate to provide fine-grained
moment-text correlations required for cross-domain VMR. In this work, we solve
the problem of unseen cross-domain VMR, where certain visual and textual
concepts do not overlap across domains, by only utilising target domain
sentences (text prompts) without accessing their videos. To that end, we
explore generative video diffusion for fine-grained editing of source videos
controlled by the target sentences, enabling us to simulate target domain
videos. We address two problems in video editing for optimising unseen domain
VMR: (1) generation of high-quality simulation videos of different moments with
subtle distinctions, (2) selection of simulation videos that complement
existing source training videos without introducing harmful noise or
unnecessary repetitions. On the first problem, we formulate a two-stage video
diffusion generation controlled simultaneously by (1) the original video
structure of a source video, (2) subject specifics, and (3) a target sentence
prompt. This ensures fine-grained variations between video moments. On the
second problem, we introduce a hybrid selection mechanism that combines two
quantitative metrics for noise filtering and one qualitative metric for
leveraging VMR prediction on simulation video selection.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要