Click model-based information retrieval metrics

In recent years many models have been proposed that are aimed at predicting clicks of web search users. In addition, some information retrieval evaluation metrics have been built on top of a user model. In this paper we bring these two directions together and propose a common approach to converting any click model into an evaluation metric. We then put the resulting model-based metrics as well as traditional metrics (like DCG or Precision) into a common evaluation framework and compare them along a number of dimensions. One of the dimensions we are particularly interested in is the agreement between offline and online experimental outcomes. It is widely believed, especially in an industrial setting, that online A/B-testing and interleaving experiments are generally better at capturing system quality than offline measurements. We show that offline metrics that are based on click models are more strongly correlated with online experimental outcomes than traditional offline metrics, especially in situations when we have incomplete relevance judgements.

[1]  J. Shane Culpepper,et al.  Including summaries in system evaluation , 2009, SIGIR.

[2]  Charles L. A. Clarke,et al.  Time-based calibration of effectiveness measures , 2012, SIGIR '12.

[3]  Chao Liu,et al.  Efficient multiple-click models in web search , 2009, WSDM '09.

[4]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[5]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[6]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[7]  Milad Shokouhi,et al.  Expected browsing utility for web search evaluation , 2010, CIKM.

[8]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[9]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[10]  Ellen M. Voorhees Variations in relevance judgments and the measurement of retrieval effectiveness , 2000, Inf. Process. Manag..

[11]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[12]  Mark Sanderson,et al.  Forming test collections with no system pooling , 2004, SIGIR '04.

[13]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[14]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[15]  James Allan,et al.  If I Had a Million Queries , 2009, ECIR.

[16]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[17]  M. de Rijke,et al.  Generating Pseudo Test Collections for Learning to Rank Scientific Articles , 2012, CLEF.

[18]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[19]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[20]  Benjamin Piwowarski,et al.  A user browsing model to predict search engine click data from past observations. , 2008, SIGIR '08.

[21]  Aleksandr Chuklin,et al.  Good abandonments in factoid queries , 2012, WWW.

[22]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[23]  C. C. Chang,et al.  On the relationship between click rate and relevance for search engines , 2006 .

[24]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[25]  Tetsuya Sakai,et al.  Alternatives to Bpref , 2007, SIGIR.

[26]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[27]  M. de Rijke,et al.  Building simulated queries for known-item topics: an analysis using six european languages , 2007, SIGIR.

[28]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[29]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[30]  James Allan,et al.  Incremental test collections , 2005, CIKM '05.

[31]  Ryen W. White,et al.  Improving searcher models using mouse cursor activity , 2012, SIGIR '12.

[32]  M. de Rijke,et al.  Using Intent Information to Model User Behavior in Diversified Search , 2013, DIR.

[33]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[34]  Ben Carterette,et al.  System effectiveness, user models, and user utility: a conceptual framework for investigation , 2011, SIGIR.

[35]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[36]  Qiang Yang,et al.  Beyond ten blue links: enabling user click modeling in federated web search , 2012, WSDM '12.

[37]  Yuchen Zhang,et al.  User-click modeling for understanding and predicting search-behavior , 2011, KDD.

[38]  Tetsuya Sakai,et al.  Evaluating evaluation metrics based on the bootstrap , 2006, SIGIR.

[39]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[40]  Chao Liu,et al.  Click chain model in web search , 2009, WWW '09.

[41]  Charles L. A. Clarke,et al.  A comparative analysis of cascade measures for novelty and diversity , 2011, WSDM '11.