Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
CoRR(2024)
摘要
LLM-based auto-annotators have become a key component of the LLM development
process due to their cost-effectiveness and scalability compared to human-based
evaluation. However, these auto-annotators can introduce complex biases that
are hard to remove. Even simple, known confounders such as preference for
longer outputs remain in existing automated evaluation metrics. We propose a
simple regression analysis approach for controlling biases in auto-evaluations.
As a real case study, we focus on reducing the length bias of AlpacaEval, a
fast and affordable benchmark for chat LLMs that uses LLMs to estimate response
quality. Despite being highly correlated with human preferences, AlpacaEval is
known to favor models that generate longer outputs. We introduce a
length-controlled AlpacaEval that aims to answer the counterfactual question:
"What would the preference be if the model's and baseline's output had the same
length?". To achieve this, we first fit a generalized linear model to predict
the biased output of interest (auto-annotator preferences) based on the
mediators we want to control for (length difference) and other relevant
features. We then obtain length-controlled preferences by predicting
preferences while conditioning the GLM with a zero difference in lengths.
Length-controlling not only improves the robustness of the metric to
manipulations in model verbosity, we also find that it increases the Spearman
correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. We release the code
and leaderboard at https://tatsu-lab.github.io/alpaca_eval/ .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要