InfiCoder-Eval: Systematically Evaluating the Question-Answering Capabilities of Code Large Language Models
arxiv(2024)
摘要
Large Language Models for understanding and generating code (code LLMs) have
witnessed tremendous progress in recent years. With the rapid development of
code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and
MBPP, have emerged to measure the performance of code LLMs with a particular
focus on code generation tasks. However, they are insufficient to cover the
full range of expected capabilities of code LLMs, which span beyond code
generation to answering diverse coding-related questions. To fill this gap, we
propose InfiCoder-Eval, a large-scale freeform question-answering (QA)
benchmark for code, comprising 234 carefully selected high-quality Stack
Overflow questions that span across 15 programming languages. To evaluate the
response correctness, InfiCoder-Eval supports four types of model-free metrics
and domain experts carefully choose and concretize the criterion for each
question. We conduct a systematic evaluation for more than 80 code LLMs on
InfiCoder-Eval, leading to a series of insightful findings. Furthermore, our
detailed analyses showcase possible directions for further improvement of code
LLMs. InfiCoder-Eval is fully open source at
https://infi-coder.github.io/inficoder-eval/ and continuously maintaining and
expanding to foster more scientific and systematic practices for evaluating
code LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要