论文信息 - cwl_eval: An Evaluation Tool for Information Retrieval

cwl_eval: An Evaluation Tool for Information Retrieval

We present a tool ("cwl_eval") which unifies many metrics typically used to evaluate information retrieval systems using test collections. In the CWL framework metrics are specified via a single function which can be used to derive a number of related measurements: Expected Utility per item, Expected Total Utility, Expected Cost per item, Expected Total Cost, and Expected Depth. The CWL framework brings together several independent approaches for measuring the quality of a ranked list, and provides a coherent user model-based framework for developing measures based on utility (gain) and cost. Here we outline the CWL measurement framework; describe the cwl_eval architecture; and provide examples of how to use it. We provide implementations of a number of recent metrics, including Time Biased Gain, U-Measure, Bejewelled Measure, and the Information Foraging Based Measure, as well as previous metrics such as Precision, Average Precision, Discounted Cumulative Gain, Rank-Biased Precision, and INST. By providing state-of-the-art and traditional metrics within the same framework, we promote a standardised approach to evaluating search effectiveness.

[1] Peter Bailey,et al. Incorporating User Expectations and Behavior into the Measurement of Search Effectiveness , 2017, ACM Trans. Inf. Syst..

[2] Alistair Moffat,et al. How Precise Does Document Scoring Need to Be? , 2016, AIRS.

[3] Peter Bailey,et al. INST: An Adaptive Metric for Information Retrieval Evaluation , 2015, ADCS.

[4] Alistair Moffat,et al. Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[5] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[6] Tetsuya Sakai,et al. Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[7] Fan Zhang,et al. Evaluating Web Search with a Bejeweled Player Model , 2017, SIGIR.

[8] Charles L. A. Clarke,et al. Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[9] Paul Thomas,et al. Measuring the Utility of Search Engine Result Pages: An Information Foraging Based Measure , 2018, SIGIR.

[10] Mark Sanderson,et al. Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[11] Guido Zuccon,et al. A Test Collection for Matching Patients to Clinical Trials , 2016, SIGIR.

[12] Guido Zuccon,et al. TrecTools: an Open-source Python Library for Information Retrieval Practitioners Involved in TREC-like Campaigns , 2019, SIGIR.

[13] Norbert Fuhr,et al. Some Common Mistakes In IR Evaluation, And How They Can Be Avoided , 2018, SIGIR Forum.

[14] Alistair Moffat,et al. Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.