Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering

Hung-Ting Su,Yulei Niu,Xudong Lin,Winston H. Hsu,Shih-Fu Chang

CoRR（2023）

引用 2|浏览80

暂无评分

摘要

Causal Video Question Answering (CVidQA) queries not only association or temporal relations but also causal relations in a video. Existing question synthesis methods pre-trained question generation (QG) systems on reading comprehension datasets with text descriptions as inputs. However, QG models only learn to ask association questions (e.g., ``what is someone doing...'') and result in inferior performance due to the poor transfer of association knowledge to CVidQA, which focuses on causal questions like ``why is someone doing ...''. Observing this, we proposed to exploit causal knowledge to generate question-answer pairs, and proposed a novel framework, Causal Knowledge Extraction from Language Models (CaKE-LM), leveraging causal commonsense knowledge from language models to tackle CVidQA. To extract knowledge from LMs, CaKE-LM generates causal questions containing two events with one triggering another (e.g., ``score a goal'' triggers ``soccer player kicking ball'') by prompting LM with the action (soccer player kicking ball) to retrieve the intention (to score a goal). CaKE-LM significantly outperforms conventional methods by 4% to 6% of zero-shot CVidQA accuracy on NExT-QA and Causal-VidQA datasets. We also conduct comprehensive analyses and provide key findings for future research.

查看译文

关键词

association knowledge,association questions,CaKE-LM,causal commonsense knowledge,Causal Knowledge Extraction,Causal Knowledge extractors,causal questions,Causal Video Question Answering,Causal-VidQA datasets,language models,QG models,question synthesis methods pretrained question generation systems,question-answer pairs,zero-shot CVidQA accuracy,zero-shot Video Question

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要