Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Vivek Rathod,Bryan Seybold,Sudheendra Vijayanarasimhan,Austin Myers,Xiuye Gu,Vighnesh Birodkar,David A. Ross

arxiv（2023）

引用 0|浏览107

暂无评分

摘要

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

查看译文

关键词

action,detection,features,open-vocabulary,off-the-shelf,image-text

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要