cwl_eval: An Evaluation Tool for Information Retrieval

We present a tool ("cwl_eval") which unifies many metrics typically used to evaluate information retrieval systems using test collections. In the CWL framework metrics are specified via a single function which can be used to derive a number of related measurements: Expected Utility per item, Expected Total Utility, Expected Cost per item, Expected Total Cost, and Expected Depth. The CWL framework brings together several independent approaches for measuring the quality of a ranked list, and provides a coherent user model-based framework for developing measures based on utility (gain) and cost. Here we outline the CWL measurement framework; describe the cwl_eval architecture; and provide examples of how to use it. We provide implementations of a number of recent metrics, including Time Biased Gain, U-Measure, Bejewelled Measure, and the Information Foraging Based Measure, as well as previous metrics such as Precision, Average Precision, Discounted Cumulative Gain, Rank-Biased Precision, and INST. By providing state-of-the-art and traditional metrics within the same framework, we promote a standardised approach to evaluating search effectiveness.

[1]  Peter Bailey,et al.  Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness , 2017, ACM Trans. Inf. Syst..

[2]  Alistair Moffat,et al.  How Precise Does Document Scoring Need to Be? , 2016, AIRS.

[3]  Peter Bailey,et al.  INST: An Adaptive Metric for Information Retrieval Evaluation , 2015, ADCS.

[4]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[5]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[6]  Tetsuya Sakai,et al.  Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[7]  Fan Zhang,et al.  Evaluating Web Search with a Bejeweled Player Model , 2017, SIGIR.

[8]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[9]  Paul Thomas,et al.  Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure , 2018, SIGIR.

[10]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[11]  Guido Zuccon,et al.  A Test Collection for Matching Patients to Clinical Trials , 2016, SIGIR.

[12]  Guido Zuccon,et al.  TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns , 2019, SIGIR.

[13]  Norbert Fuhr,et al.  Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[14]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.