A Comparison of Methods for Evaluating Generative IR
arxiv(2024)
摘要
Information retrieval systems increasingly incorporate generative components.
For example, in a retrieval augmented generation (RAG) system, a retrieval
component might provide a source of ground truth, while a generative component
summarizes and augments its responses. In other systems, a large language model
(LLM) might directly generate responses without consulting a retrieval
component. While there are multiple definitions of generative information
retrieval (Gen-IR) systems, in this paper we focus on those systems where the
system's response is not drawn from a fixed collection of documents or
passages. The response to a query may be entirely new text never. Since
traditional IR evaluation methods break down under this model, we explore
various methods that extend traditional offline evaluation approaches to the
Gen-IR context. Offline IR evaluation traditionally employs paid human
assessors, but increasingly LLMs are replacing human assessment, demonstrating
capabilities similar or superior to crowdsourced labels. Given that Gen-IR
systems do not generate responses from a fixed set, we assume that methods for
Gen-IR evaluation must largely depend on LLM-generated labels. Along with
methods based on binary and graded relevance, we explore methods based on
explicit subtopics, pairwise preferences, and embeddings. We first validate
these methods against human assessments on several TREC Deep Learning Track
tasks; we then apply these methods to evaluate the output of several purely
generative systems. For each method we consider both its ability to act
autonomously, without the need for human labels or other input, and its ability
to support human auditing. To trust these methods, we must be assured that
their results align with human assessments. In order to do so, evaluation
criteria must be transparent, so that outcomes can be audited by human
assessors.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要