Evaluation Measures Based on Preference Graphs

Research and Development in Information Retrieval（2021）

引用 8|浏览87

暂无评分

摘要

ABSTRACTThe offline evaluation of search requires us to define a standard against which we measure the quality of results returned by a ranker. Frequently this standard is defined in absolute terms through relevance grades, but it can also be defined in relative terms through preferences. These preferences might be created through explicit preference judgments, derived from relevance grades, or inferred from clicks and other signals. Preferences from multiple sources might even be combined. In contrast to absolute grades, preferences avoid complex definitions of relevance, indicating only that a ranker should favor one result over another. Despite the simplicity and flexibility of preferences, widespread adoption has been limited by the lack of established evaluation measures. Recent work in this direction has taken two approaches: 1) measures based on weighted counts of agreements and disagreements between a set of preferences and an actual ranking generated by a ranker; and 2) measures that translate preferences into gain values for use with traditional measures, such as nDCG. Both approaches require methods for specifying weights or gains that have little or no theoretical foundation, and the values of these measures have no clear and meaningful interpretation. To address these problems, we propose an evaluation measure that computes the similarity between a directed multigraph of preferences and an actual ranking generated by a ranker. The measure computes an ordering for the vertices of the preference graph that maximizes its similarity to the actual ranking under a rank similarity measure. This maximum similarity becomes the value of the measure. Preference graphs are often acyclic, or nearly so, and to compute the measure we extend an approximate greedy algorithm that is known to produce good results for nearly acyclic graphs. For the rank similarity measure we employ Rank Biased Overlap (RBO) which was explicitly created to match the requirements of search and related applications. We validate the new measure over several collections of preferences explored in recent work.

查看译文

关键词

offline evaluation, preferences, feedback arc set

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要