Large-scale validation and analysis of interleaved search evaluation

Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.

[1]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[2]  Filip Radlinski,et al.  Redundancy, diversity and interdependent document relevance , 2009, SIGF.

[3]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[4]  RadlinskiFilip,et al.  Redundancy, diversity and interdependent document relevance , 2009 .

[5]  Samuel Kaski,et al.  Implicit Relevance Feedback from Eye Movements , 2005, ICANN.

[6]  Józef Kozielecki Psychological Decision Theory , 1981 .

[7]  Henry Lieberman,et al.  Letizia: An Agent That Assists Web Browsing , 1995, IJCAI.

[8]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[9]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[10]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[11]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[12]  Alan Halverson,et al.  Generating labels from clicks , 2009, WSDM '09.

[13]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[14]  Dayne Freitag,et al.  A Machine Learning Architecture for Optimizing Web Search Engines , 1999 .

[15]  J. Wolfowitz,et al.  An Introduction to the Theory of Statistics , 1951, Nature.

[16]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[17]  Cyril W. Cleverdon,et al.  Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 1, Design; Part 2, Appendices , 1966 .

[18]  Kuansan Wang,et al.  PSkip: estimating relevance ranking quality from web search clickthrough data , 2009, KDD.

[19]  Susan T. Dumais,et al.  Characterizing the value of personalizing search , 2007, SIGIR.

[20]  Susan T. Dumais,et al.  Learning user interaction models for predicting web search result preferences , 2006, SIGIR.

[21]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[22]  Amanda Spink,et al.  New Directions in Cognitive Information Retrieval , 2005 .

[23]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[24]  David Maxwell Chickering,et al.  Modeling Contextual Factors of Click Rates , 2007, AAAI.

[25]  Ben Carterette,et al.  Million Query Track 2007 Overview , 2008, TREC.

[26]  Andreas Dengel,et al.  Eye movements as implicit relevance feedback , 2008, CHI Extended Abstracts.

[27]  Ellen M. Voorhees,et al.  The effect of topic set size on retrieval experiment error , 2002, SIGIR '02.

[28]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[29]  Ian Soboroff,et al.  Ranking retrieval systems without relevance judgments , 2001, SIGIR '01.

[30]  Diane Kelly,et al.  IMPLICIT FEEDBACK: USING BEHAVIOR TO INFER RELEVANCE , 2005 .

[31]  RadlinskiFilip,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012 .

[32]  Paul B. Kantor,et al.  National language-Specific Evaluation Sites for Retrieval Systems and Interfaces , 1988, RIAO.

[33]  Benjamin Piwowarski,et al.  Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[34]  Olivier Chapelle,et al.  A dynamic bayesian network click model for web search ranking , 2009, WWW '09.

[35]  Birger Larsen,et al.  A Qualitative Look at Eye-tracking for Implicit Relevance Feedback , 2007, CIR.

[36]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[37]  Man Lung Yiu,et al.  Group-by skyline query processing in relational engines , 2009, CIKM.

[38]  Yoichi Shinoda,et al.  Information filtering based on user behavior analysis and best match text retrieval , 1994, SIGIR '94.

[39]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[40]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[41]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[42]  Ryen W. White,et al.  Evaluating implicit feedback models using searcher simulations , 2005, TOIS.

[43]  Jaime Teevan,et al.  Implicit feedback for inferring user preference: a bibliography , 2003, SIGF.

[44]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[45]  Thorsten Joachims,et al.  Web Watcher: A Tour Guide for the World Wide Web , 1997, IJCAI.

[46]  Scott B. Huffman,et al.  How well does result relevance predict session satisfaction? , 2007, SIGIR.

[47]  Filip Radlinski,et al.  Inferring query intent from reformulations and clicks , 2010, WWW '10.

[48]  Franklin A. Graybill,et al.  Introduction to the Theory of Statistics, 3rd ed. , 1974 .

[49]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[50]  Douglas W. Oard,et al.  Modeling Information Content Using Observable Behavior , 2001 .

[51]  Yiqun Liu,et al.  Automatic search engine performance evaluation with click-through data analysis , 2007, WWW '07.

[52]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[53]  Filip Radlinski,et al.  Evaluating Search Engine Relevance with Click-Based Metrics , 2010, Preference Learning.

[54]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[55]  Eli Upfal,et al.  Computing with Noisy Information , 1994, SIAM J. Comput..

[56]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[57]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[58]  M. Chavance [Jackknife and bootstrap]. , 1992, Revue d'epidemiologie et de sante publique.

[59]  Thorsten Joachims,et al.  The K-armed Dueling Bandits Problem , 2012, COLT.

[60]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[61]  C. C. Chang,et al.  On the relationship between click rate and relevance for search engines , 2006 .

[62]  Yisong Yue,et al.  Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data , 2010, WWW '10.

[63]  Filip Radlinski,et al.  Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs , 2006, AAAI 2006.

[64]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[65]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[66]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[67]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[68]  Emine Yilmaz,et al.  A Sampling Technique for Ecien tly Estimating Measures of Query Retrieval Performance Using Incomplete Judgments , 2005 .

[69]  Thorsten Joachims,et al.  Beat the Mean Bandit , 2011, ICML.

[70]  Susan T. Dumais,et al.  Understanding temporal query dynamics , 2011, WSDM '11.

[71]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[72]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .

[73]  Ryen W. White,et al.  The Use of Implicit Evidence for Relevance Feedback in Web Retrieval , 2002, ECIR.

[74]  Mark Claypool,et al.  Implicit interest indicators , 2001, IUI '01.