Eyes Can Deceive: Benchmarking Counterfactual Reasoning Abilities of Multi-modal Large Language Models
arxiv(2024)
摘要
Counterfactual reasoning, as a crucial manifestation of human intelligence,
refers to making presuppositions based on established facts and extrapolating
potential outcomes. Existing multimodal large language models (MLLMs) have
exhibited impressive cognitive and reasoning capabilities, which have been
examined across a wide range of Visual Question Answering (VQA) benchmarks.
Nevertheless, how will existing MLLMs perform when faced with counterfactual
questions? To answer this question, we first curate a novel
CounterFactual MultiModal reasoning
benchmark, abbreviated as CFMM, to systematically assess the
counterfactual reasoning capabilities of MLLMs. Our CFMM comprises six
challenging tasks, each including hundreds of carefully human-labeled
counterfactual questions, to evaluate MLLM's counterfactual reasoning
capabilities across diverse aspects. Through experiments, interestingly, we
find that existing MLLMs prefer to believe what they see, but ignore the
counterfactual presuppositions presented in the question, thereby leading to
inaccurate responses. Furthermore, we evaluate a wide range of prevalent MLLMs
on our proposed CFMM. The significant gap between their performance on our CFMM
and that on several VQA benchmarks indicates that there is still considerable
room for improvement in existing MLLMs toward approaching human-level
intelligence. On the other hand, through boosting MLLMs performances on our
CFMM in the future, potential avenues toward developing MLLMs with advanced
intelligence can be explored.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要