Generalized Team Draft Interleaving

Interleaving is an online evaluation method that compares two ranking functions by mixing their results and interpreting the users' click feedback. An important property of an interleaving method is its sensitivity, i.e. the ability to obtain reliable comparison outcomes with few user interactions. Several methods have been proposed so far to improve interleaving sensitivity, which can be roughly divided into two areas: (a) methods that optimize the credit assignment function (how the click feedback is interpreted), and (b) methods that achieve higher sensitivity by controlling the interleaving policy (how often a particular interleaved result page is shown). In this paper, we propose an interleaving framework that generalizes the previously studied interleaving methods in two aspects. First, it achieves a higher sensitivity by performing a joint data-driven optimization of the credit assignment function and the interleaving policy. Second, we formulate the framework to be general w.r.t. the search domain where the interleaving experiment is deployed, so that it can be applied in domains with grid-based presentation, such as image search. In order to simplify the optimization, we additionally introduce a stratified estimate of the experiment outcome. This stratification is also useful on its own, as it reduces the variance of the outcome and thus increases the interleaving sensitivity. We perform an extensive experimental study using large-scale document and image search datasets obtained from a commercial search engine. The experiments show that our proposed framework achieves marked improvements in sensitivity over effective baselines on both datasets.

[1]  Craig MacDonald,et al.  Using historical click data to increase interleaving sensitivity , 2013, CIKM.

[2]  Ron Kohavi,et al.  Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[3]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[4]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[5]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[6]  Ron Kohavi,et al.  Online Experimentation at Microsoft , 2009 .

[7]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[8]  Christian P. Robert,et al.  Introducing Monte Carlo Methods with R , 2009 .

[9]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[10]  Peter W. Glynn,et al.  Stochastic Simulation: Algorithms and Analysis , 2007 .

[11]  Katja Hofmann,et al.  Evaluating aggregated search using interleaving , 2013, CIKM.

[12]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[13]  Ron Kohavi,et al.  Improving the sensitivity of online controlled experiments by utilizing pre-experiment data , 2013, WSDM.

[14]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[15]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[16]  Mounia Lalmas,et al.  Absence time and user engagement: evaluating ranking functions , 2013, WSDM '13.

[17]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[18]  Christian P. Robert,et al.  Introducing Monte Carlo Methods with R (Use R) , 2009 .

[19]  Yue Gao,et al.  Learning more powerful test statistics for click-based retrieval evaluation , 2010, SIGIR.

[20]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[21]  Ron Kohavi,et al.  Online controlled experiments at large scale , 2013, KDD.