Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models
arxiv(2024)
摘要
We introduce Vibe-Eval: a new open benchmark and framework for evaluating
multimodal chat models. Vibe-Eval consists of 269 visual understanding prompts,
including 100 of hard difficulty, complete with gold-standard responses
authored by experts. Vibe-Eval is open-ended and challenging with dual
objectives: (i) vibe checking multimodal chat models for day-to-day tasks and
(ii) rigorously testing and probing the capabilities of present frontier
models. Notably, our hard set contains >50
answer incorrectly. We explore the nuances of designing, evaluating, and
ranking models on ultra challenging prompts. We also discuss trade-offs between
human and automatic evaluation, and show that automatic model evaluation using
Reka Core roughly correlates to human judgment. We offer free API access for
the purpose of lightweight evaluation and plan to conduct formal human
evaluations for public models that perform well on the Vibe-Eval's automatic
scores. We release the evaluation code and data, see
https://github.com/reka-ai/reka-vibe-eval
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要