Practical Online Retrieval Evaluation

Online evaluation allows the assessment of information retrieval (IR) techniques based on how real users respond to them. Because this technique is directly based on observed user behavior, it is a promising alternative to traditional offline evaluation, which is based on manual relevance assessments. In particular, online evaluation can enable comparisons in settings where reliable assessments are difficult to obtain (e.g., personalized search) or expensive (e.g., for search by trained experts in specialized collections). Despite its advantages, and its successful use in commercial settings, online evaluation is rarely employed outside of large commercial search engines due to a perception that it is impractical at small scales. The goal of this tutorial is to show how online evaluations can be conducted in such settings, demonstrate software to facilitate its use, and promote further research in the area. We will also contrast online evaluation with standard offline evaluation, and provide an overview of online approaches.

[1]  Benjamin Piwowarski,et al.  Web Search Engine Evaluation Using Clickthrough Data and a User Model , 2007 .

[2]  Katja Hofmann,et al.  Estimating interleaved comparison outcomes from historical click data , 2012, CIKM '12.

[3]  Yisong Yue,et al.  Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data , 2010, WWW '10.

[4]  Susan T. Dumais,et al.  Learning user interaction models for predicting web search result preferences , 2006, SIGIR.

[5]  Mark Cramer,et al.  Demonstration of Improved Search Result Relevancy Using Real-Time Implicit Relevance Feedback , 2009, UIIR@SIGIR.

[6]  Filip Radlinski,et al.  Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs , 2006, AAAI 2006.

[7]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[8]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[9]  Nick Craswell,et al.  An experimental comparison of click position-bias models , 2008, WSDM '08.

[10]  Kuansan Wang,et al.  PSkip: estimating relevance ranking quality from web search clickthrough data , 2009, KDD.

[11]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[12]  Filip Radlinski,et al.  Detecting duplicate web documents using clickthrough data , 2011, WSDM '11.

[13]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[14]  Ellen M. Voorhees,et al.  TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing) , 2005 .

[15]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[16]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[17]  Filip Radlinski,et al.  On caption bias in interleaving experiments , 2012, CIKM '12.

[18]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[19]  Ben Carterette,et al.  Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks , 2007, NIPS.

[20]  Filip Radlinski,et al.  Personalizing web search using long term browsing history , 2011, WSDM '11.

[21]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[22]  Charles L. A. Clarke,et al.  The influence of caption features on clickthrough patterns in web search , 2007, SIGIR.

[23]  David Maxwell Chickering,et al.  Here or There , 2008, ECIR.

[24]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[25]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .