Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM
arxiv(2024)
摘要
LLMs have become the go-to choice for code generation tasks, with an
exponential increase in the training, development, and usage of LLMs
specifically for code generation. To evaluate the ability of LLMs on code, both
academic and industry practitioners rely on popular handcrafted benchmarks.
However, prior benchmarks contain only a very limited set of problems, both in
quantity and variety. Further, due to popularity and age, many benchmarks are
prone to data leakage where example solutions can be readily found on the web
and thus potentially in training data. Such limitations inevitably lead us to
inquire: Is the leaderboard performance on existing benchmarks reliable and
comprehensive enough to measure the program synthesis ability of LLMs? To
address this, we introduce EvoEval – a program synthesis benchmark suite
created by evolving existing benchmarks into different targeted domains for a
comprehensive evaluation of LLM coding abilities. Our study on 51 LLMs shows
that compared to the high performance obtained on standard benchmarks like
HumanEval, there is a significant drop in performance (on average 39.4
using EvoEval. Additionally, the decrease in performance can range from 19.6
to 47.7
overfitting of existing benchmarks. Furthermore, we showcase various insights,
including the brittleness of instruction-following models when encountering
rewording or subtle changes as well as the importance of learning problem
composition and decomposition. EvoEval not only provides comprehensive
benchmarks, but can be used to further evolve arbitrary problems to keep up
with advances and the ever-changing landscape of LLMs for code. We have
open-sourced our benchmarks, tools, and complete LLM generations at
https://github.com/evo-eval/evoeval
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要