IR Evaluation: Modeling User Behavior for Measuring Effectiveness

This half-day tutorial on IR evaluation combines an introduction to classical IR evaluation methods with material on more recent user-oriented approaches. We primarily focus on off-line evaluation, but some material on on-line evaluation is also covered. The broad goal of the tutorial is to equip researchers with an understanding of modern approaches to IR evaluation, facilitating new research on this topic and improving evaluation methodology for emerging areas.

[1]  Tetsuya Sakai,et al.  Summaries, ranked retrieval and sessions: a unified framework for information access evaluation , 2013, SIGIR.

[2]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[3]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[4]  Charles L. A. Clarke,et al.  Reliable information retrieval evaluation with incomplete and biased judgements , 2007, SIGIR.

[5]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[6]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[7]  Alistair Moffat,et al.  Statistical power in retrieval experimentation , 2008, CIKM '08.

[8]  Alistair Moffat,et al.  Click-based evidence for decaying weight distributions in search effectiveness metrics , 2010, Information Retrieval.

[9]  Tague-SutcliffeJean The pragmatics of information retrieval experimentation, revisited , 1992 .

[10]  Stephen E. Robertson,et al.  Extending average precision to graded relevance judgments , 2010, SIGIR.

[11]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[12]  Benjamin Piwowarski,et al.  Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[13]  Mark Sanderson,et al.  Forming test collections with no system pooling , 2004, SIGIR '04.

[14]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[15]  Cyril Cleverdon,et al.  The Cranfield tests on index language devices , 1997 .

[16]  Emine Yilmaz,et al.  The maximum entropy method for analyzing retrieval measures , 2005, SIGIR '05.

[17]  Mounia Lalmas,et al.  Absence time and user engagement: evaluating ranking functions , 2013, WSDM '13.

[18]  Ben Carterette,et al.  Evaluating multi-query sessions , 2011, SIGIR.

[19]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[20]  Andrew Trotman,et al.  Sound and complete relevance assessment for XML retrieval , 2008, TOIS.

[21]  Yiqun Liu,et al.  Automatic search engine performance evaluation with click-through data analysis , 2007, WWW '07.

[22]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[23]  Charles L. A. Clarke,et al.  Efficient construction of large test collections , 1998, SIGIR '98.

[24]  Gabriella Kazai,et al.  User intent and assessor disagreement in web search evaluation , 2013, CIKM.

[25]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[26]  Ben Carterette,et al.  Incorporating variability in user behavior into systems based evaluation , 2012, CIKM.

[27]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[28]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[29]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[30]  Stephen E. Robertson,et al.  On GMAP: and other transformations , 2006, CIKM '06.

[31]  Stephen E. Robertson,et al.  A new interpretation of average precision , 2008, SIGIR '08.

[32]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[33]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[34]  Ben Carterette,et al.  Robust test collections for retrieval evaluation , 2007, SIGIR.

[35]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[36]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[37]  Alistair Moffat,et al.  Strategic system comparisons via targeted relevance judgments , 2007, SIGIR.

[38]  Mark Sanderson,et al.  Do user preferences and evaluation measures line up? , 2010, SIGIR.

[39]  Thorsten Joachims,et al.  Accurately interpreting clickthrough data as implicit feedback , 2005, SIGIR '05.

[40]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[41]  Charles L. A. Clarke,et al.  Stochastic simulation of time-biased gain , 2012, CIKM '12.

[42]  Gabriella Kazai,et al.  eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval , 2006, TOIS.

[43]  Charles L. A. Clarke,et al.  A comparative analysis of cascade measures for novelty and diversity , 2011, WSDM '11.

[44]  Emine Yilmaz,et al.  Estimating average precision when judgments are incomplete , 2007, Knowledge and Information Systems.

[45]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[46]  Charles L. A. Clarke,et al.  Modeling user variance in time-biased gain , 2012, HCIR '12.

[47]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[48]  Cassidy R. Sugimoto,et al.  A systematic review of interactive information retrieval evaluation studies, 1967-2006 , 2013, J. Assoc. Inf. Sci. Technol..

[49]  Gabriella Kazai,et al.  Relevance dimensions in preference-based IR evaluation , 2013, SIGIR.

[50]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[51]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[52]  Emine Yilmaz,et al.  A statistical method for system evaluation using incomplete judgments , 2006, SIGIR.

[53]  Ben Carterette,et al.  Simulating simple user behavior for system effectiveness evaluation , 2011, CIKM '11.