Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval.

Niluthpol Chowdhury Mithun,Juncheng Li,Florian Metze,Amit K. Roy-Chowdhury

ICMR '18: International Conference on Multimedia Retrieval Yokohama Japan June, 2018（2018）

引用 272|浏览503

暂无评分

摘要

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, however, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multimodal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the embedding and propose a modified pairwise ranking loss for the task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.

查看译文

关键词

Video-Text Retrieval, Joint Embedding, Multimodal Cues

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要