The twist measure for IR evaluation: Taking user's effort into account

We present a novel measure for ranking evaluation, called Twist (τ). It is a measure for informational intents, which handles both binary and graded relevance. τ stems from the observation that searching is currently a that searching is currently taken for granted and it is natural for users to assume that search engines are available and work well. As a consequence, users may assume the utility they have in finding relevant documents, which is the focus of traditional measures, as granted. On the contrary, they may feel uneasy when the system returns nonrelevant documents because they are then forced to do additional work to get the desired information, and this causes avoidable effort. The latter is the focus of τ, which evaluates the effectiveness of a system from the point of view of the effort required to the users to retrieve the desired information. We provide a formal definition of τ, a demonstration of its properties, and introduce the notion of effort/gain plots, which complement traditional utility‐based measures. By means of an extensive experimental evaluation, τ is shown to grasp different aspects of system performances, to not require extensive and costly assessments, and to be a robust tool for detecting differences between systems.

[1]  Jonathan Barzilai,et al.  On the foundations of measurement , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[2]  Stephen E. Robertson,et al.  Modelling A User Population for Designing Information Retrieval Metrics , 2008, EVIA@NTCIR.

[3]  Ellen M. Voorhees,et al.  Retrieval System Evaluation , 2005 .

[4]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[5]  David Hawking,et al.  Overview of the TREC-2001 Web track , 2002 .

[6]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[7]  Ben Carterette,et al.  Evaluating Web Retrieval Effectiveness , 2012 .

[8]  Chris Buckley Why current IR engines fail , 2004, SIGIR '04.

[9]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[10]  S S Stevens,et al.  On the Theory of Scales of Measurement. , 1946, Science.

[11]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[12]  W. Ferger The Nature and Use of the Harmonic Mean , 1931 .

[13]  A. Tversky,et al.  Foundations of Measurement, Vol. I: Additive and Polynomial Representations , 1991 .

[14]  Diane Kelly,et al.  Methods for Evaluating Interactive Information Retrieval Systems with Users , 2009, Found. Trends Inf. Retr..

[15]  Emre Velipasaoglu,et al.  Intent-based diversification of web search results: metrics and algorithms , 2011, Information Retrieval.

[16]  Amanda Spink,et al.  Determining the informational, navigational, and transactional intent of Web queries , 2008, Inf. Process. Manag..

[17]  Nicola Ferro Bridging Between Information Retrieval and Databases: PROMISE Winter School 2013, Bressanone, Italy, February 4-8, 2013. Revised Tutorial Lectures ... Applications, incl. Internet/Web, and HCI , 2014 .

[18]  Donna K. Harman Some thoughts on failure analysis for noisy data , 2008, AND '08.

[19]  Tetsuya Sakai Evaluation with informational and navigational intents , 2012, WWW.

[20]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[21]  Alistair Moffat,et al.  Precision-at-ten considered redundant , 2008, SIGIR '08.

[22]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[23]  Tetsuya Sakai,et al.  Metrics, Statistics, Tests , 2013, PROMISE Winter School.

[24]  Elaine Toms Task-based information searching and retrieval , 2011, Interactive Information Seeking, Behaviour and Retrieval.

[25]  Maja Zumer,et al.  Interactive Information Seeking, Behaviour and Retrieval , 2012, Program.

[26]  Alistair Moffat,et al.  The Effect of Pooling and Evaluation Depth on Metric Stability , 2010, EVIA@NTCIR.

[27]  Ben Carterette,et al.  Chapter 5 Evaluating Web Retrieval Effectiveness , 2012 .

[28]  Noriko Kando,et al.  On information retrieval metrics designed for evaluation with incomplete relevance assessments , 2008, Information Retrieval.

[29]  Tague-SutcliffeJean The pragmatics of information retrieval experimentation, revisited , 1992 .

[30]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[31]  Djoerd Hiemstra,et al.  Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics , 2012, Lecture Notes in Computer Science.

[32]  Peter Ingwersen,et al.  The Turn - Integration of Information Seeking and Retrieval in Context , 2005, The Kluwer International Series on Information Retrieval.

[33]  Giuseppe Santucci,et al.  Interactive Analysis and Exploration of Experimental Evaluation Results , 2011, EuroHCIR.

[34]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[35]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[36]  Andrew Trotman,et al.  Sound and complete relevance assessment for XML retrieval , 2008, TOIS.

[37]  Jaana Kekäläinen,et al.  Using graded relevance assessments in IR evaluation , 2002, J. Assoc. Inf. Sci. Technol..

[38]  Alistair Moffat,et al.  Users versus models: what observation tells us about effectiveness metrics , 2013, CIKM.

[39]  Mark D. Smucker,et al.  Human performance and retrieval precision revisited , 2010, SIGIR.

[40]  Leo Egghe,et al.  The measures precision, recall, fallout and miss as a function of the number of retrieved documents and their mutual interrelations , 2008, Inf. Process. Manag..

[41]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[42]  Christina Lioma,et al.  The tipping point: F-score as a function of the number of retrieved items , 2012, Inf. Process. Manag..

[43]  Stephen E. Robertson,et al.  Extending average precision to graded relevance judgments , 2010, SIGIR.

[44]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[45]  Fabian Steeg,et al.  Information-Retrieval: Evaluation , 2010 .

[46]  Jaana Kekäläinen,et al.  Intuition-supporting visualization of user's performance based on explicit negative higher-order relevance , 2008, SIGIR '08.

[47]  S. S. Stevens,et al.  On the averaging of data. , 1955, Science.

[48]  Charles L. A. Clarke,et al.  Overview of the TREC 2012 Web Track , 2012, TREC.

[49]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[50]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[51]  Giuseppe Santucci,et al.  Visual Comparison of Ranked Result Cumulated Gains , 2011, EuroVA@EuroVis.

[52]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[53]  Charles L. A. Clarke,et al.  Stochastic simulation of time-biased gain , 2012, CIKM '12.

[54]  Giuseppe Santucci,et al.  To Re-rank or to Re-query: Can Visual Analytics Solve This Dilemma? , 2011, CLEF.

[55]  Ian Soboroff,et al.  Dynamic test collections: measuring search effectiveness on the live web , 2006, SIGIR.

[56]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[57]  Tetsuya Sakai,et al.  On the reliability of information retrieval metrics based on graded relevance , 2007, Inf. Process. Manag..

[58]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[59]  Kalervo Järvelin User-Oriented Evaluation in IR , 2012, PROMISE Winter School.

[60]  Jean Tague-Sutcliffe,et al.  Some Perspectives on the Evaluation of Information Retrieval Systems , 1996, J. Am. Soc. Inf. Sci..

[61]  Jacques Savoy Why do successful search systems fail for some topics , 2007, SAC '07.

[62]  Giuseppe Santucci,et al.  VIRTUE: A visual tool for information retrieval performance evaluation and failure analysis , 2014, J. Vis. Lang. Comput..

[63]  Giuseppe Santucci,et al.  Visual interactive failure analysis: supporting users in information retrieval evaluation , 2012, IIR.

[64]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[65]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[66]  Giuseppe Santucci,et al.  Cumulated Relative Position: A Metric for Ranking Evaluation , 2012, IIR.

[67]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[68]  Jaana Kekäläinen,et al.  Binary and graded relevance in IR evaluations--Comparison of the effects on ranking of IR systems , 2005, Inf. Process. Manag..