Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank
arxiv(2024)
摘要
Aligning a user query and video clips in cross-modal latent space and that
with semantic concepts are two mainstream approaches for ad-hoc video search
(AVS). However, the effectiveness of existing approaches is bottlenecked by the
small sizes of available video-text datasets and the low quality of concept
banks, which results in the failures of unseen queries and the
out-of-vocabulary problem. This paper addresses these two problems by
constructing a new dataset and developing a multi-word concept bank.
Specifically, capitalizing on a generative model, we construct a new dataset
consisting of 7 million generated text and video pairs for pre-training. To
tackle the out-of-vocabulary problem, we develop a multi-word concept bank
based on syntax analysis to enhance the capability of a state-of-the-art
interpretable AVS method in modeling relationships between query words. We also
study the impact of current advanced features on the method. Experimental
results show that the integration of the above-proposed elements doubles the
R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP
on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2
to 77
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要