Transformer vision-language tracking via proxy token guided cross-modal fusion

Pattern Recognition Letters(2023)

引用 4|浏览57
暂无评分
摘要
Tracking by vision-language is an emergent topic. Previous researchers mainly adopt CNN and sequen-tial models for video and language encoding, however, their methods are limited by poor generaliza-tion performance. To address this problem, this paper presents a novel vision-language tracking frame-work based on Transformer. Specifically, our proposed framework contains the image encoder, lan-guage encoder, cross-modal fusion module, and task-specific heads. We adopt the residual network and BERT for image and language embedding, respectively. More importantly, we propose a proxy to-ken guided cross-modal fusion module based on the transformer network, which can link the vision and language features effectively and efficiently. The proxy token acts as a proxy for word embed -dings and interacts with the visual feature. By absorbing vision information, the proxy token is used to modulate word embeddings and make them attend to the visual feature. Finally, we get the or-ganically fused features via a dynamic modal aggregation method and feed them into the task-specific heads for tracking. Extensive experiments demonstrate that our method set new state-of-the-art on multiple language-assisted tracking datasets, including OTB-LANG, LaSOT, TNL2K, and a newly proposed Ref-LTB50 annotated with dense language specifications. Source code of this paper will be publicly available.(c) 2023 Elsevier B.V. All rights reserved.
更多
查看译文
关键词
Visual object tracking,Transformer,Vision-language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要