Predicting Search Satisfaction Metrics with Interleaved Comparisons

The gold standard for online retrieval evaluation is AB testing. Rooted in the idea of a controlled experiment, AB tests compare the performance of an experimental system (treatment) on one sample of the user population, to that of a baseline system (control) on another sample. Given an online evaluation metric that accurately reflects user satisfaction, these tests enjoy high validity. However, due to the high variance across users, these comparisons often have low sensitivity, requiring millions of queries to detect statistically significant differences between systems. Interleaving is an alternative online evaluation approach, where each user is presented with a combination of results from both the control and treatment systems. Compared to AB tests, interleaving has been shown to be substantially more sensitive. However, interleaving methods have so far focused on user clicks only, and lack support for more sophisticated user satisfaction metrics as used in AB testing. In this paper we present the first method for integrating user satisfaction metrics with interleaving. We show how interleaving can be extended to (1) directly match user signals and parameters of AB metrics, and (2) how parameterized interleaving credit functions can be automatically calibrated to predict AB outcomes. We also develop a new method for estimating the relative sensitivity of interleaving and AB metrics, and show that our interleaving credit functions improve agreement with AB metrics without sacrificing sensitivity. Our results, using 38 large-scale online experiments en- compassing over 3 billion clicks in a web search setting, demonstrate up to a 22% improvement in agreement with AB metrics (constituting over a 50% error reduction), while maintaining sensitivity of one to two orders of magnitude above the AB tests. This paves the way towards more sensitive and accurate online evaluation.

[1]  Steve Fox,et al.  Evaluating implicit measures to improve web search , 2005, TOIS.

[2]  Kuansan Wang,et al.  PSkip: estimating relevance ranking quality from web search clickthrough data , 2009, KDD.

[3]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[4]  Susan T. Dumais,et al.  Characterizing the value of personalizing search , 2007, SIGIR.

[5]  Eugene Agichtein,et al.  Understanding “ Abandoned ” Ads : Towards Personalized Commercial Intent Inference via Mouse Movement Analysis , 2008 .

[6]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[7]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[8]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[9]  Mounia Lalmas,et al.  Absence time and user engagement: evaluating ranking functions , 2013, WSDM '13.

[10]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[11]  Xiaolong Li,et al.  Inferring search behaviors using partially observable Markov (POM) model , 2010, WSDM '10.

[12]  Yisong Yue,et al.  Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data , 2010, WWW '10.

[13]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[14]  Chih-Hung Hsieh,et al.  Towards better measurement of attention and satisfaction in mobile search , 2014, SIGIR.

[15]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[16]  Andrew Turpin,et al.  Do batch and user evaluations give the same results? , 2000, SIGIR '00.

[17]  Andrew Turpin,et al.  Why batch and user evaluations do not give the same results , 2001, SIGIR '01.

[18]  Ryen W. White,et al.  No search result left behind: branching behavior with browser tabs , 2012, WSDM '12.

[19]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[20]  Mark Sanderson,et al.  Test Collection Based Evaluation of Information Retrieval Systems , 2010, Found. Trends Inf. Retr..

[21]  Filip Radlinski,et al.  On caption bias in interleaving experiments , 2012, CIKM '12.

[22]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[23]  Ryen W. White,et al.  Modeling dwell time to predict click-level satisfaction , 2014, WSDM.

[24]  Yang Song,et al.  Context-aware web search abandonment prediction , 2014, SIGIR.

[25]  Nick Craswell,et al.  Beyond clicks: query reformulation as a predictor of search satisfaction , 2013, CIKM.

[26]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[27]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[28]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[29]  Katja Hofmann,et al.  Lerot: an online learning to rank framework , 2013, LivingLab '13.

[30]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[31]  José Luis Vicedo González,et al.  TREC: Experiment and evaluation in information retrieval , 2007, J. Assoc. Inf. Sci. Technol..

[32]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[33]  Ahmed Hassan Awadallah,et al.  Beyond DCG: user behavior as a predictor of a successful search , 2010, WSDM '10.

[34]  Jane Li,et al.  Good abandonment in mobile and PC internet search , 2009, SIGIR.

[35]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[36]  Kuansan Wang,et al.  Inferring search behaviors using partially observable markov model with duration (POMD) , 2011, WSDM '11.

[37]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[38]  Eugene Agichtein,et al.  Towards predicting web searcher gaze position from mouse movements , 2010, CHI Extended Abstracts.

[39]  Andrew Turpin,et al.  Further Analysis of Whether Batch and User Evaluations Give the Same Results with a Question-Answering Task , 2000, TREC.

[40]  Filip Radlinski,et al.  Relevance and Effort: An Analysis of Document Utility , 2014, CIKM.

[41]  Fernando Diaz,et al.  Robust models of mouse movement on dynamic web search results pages , 2013, CIKM.

[42]  Yang Song,et al.  Modeling action-level satisfaction for search task satisfaction prediction , 2014, SIGIR.

[43]  M. de Rijke,et al.  Probabilistic Multileave for Online Retrieval Evaluation , 2015, SIGIR.

[44]  Falk Scholer,et al.  User performance versus precision measures for simple search tasks , 2006, SIGIR.

[45]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..