A Comparative Analysis of Interleaving Methods for Aggregated Search

A result page of a modern search engine often goes beyond a simple list of “10 blue links.” Many specific user needs (e.g., News, Image, Video) are addressed by so-called aggregated or vertical search solutions: specially presented documents, often retrieved from specific sources, that stand out from the regular organic Web search results. When it comes to evaluating ranking systems, such complex result layouts raise their own challenges. This is especially true for so-called interleaving methods that have arisen as an important type of online evaluation: by mixing results from two different result pages, interleaving can easily break the desired Web layout in which vertical documents are grouped together, and hence hurt the user experience. We conduct an analysis of different interleaving methods as applied to aggregated search engine result pages. Apart from conventional interleaving methods, we propose two vertical-aware methods: one derived from the widely used Team-Draft Interleaving method by adjusting it in such a way that it respects vertical document groupings, and another based on the recently introduced Optimized Interleaving framework. We show that our proposed methods are better at preserving the user experience than existing interleaving methods while still performing well as a tool for comparing ranking systems. For evaluating our proposed vertical-aware interleaving methods, we use real-world click data as well as simulated clicks and simulated ranking systems.

[1]  Joemon M. Jose,et al.  Evaluating reward and risk for vertical selection , 2012, CIKM '12.

[2]  Djoerd Hiemstra,et al.  Federated search in the wild: the combined power of over a hundred search engines , 2012, CIKM '12.

[3]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[4]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[5]  Filip Radlinski,et al.  Comparing the sensitivity of information retrieval metrics , 2010, SIGIR.

[6]  Tetsuya Sakai,et al.  On the reliability and intuitiveness of aggregated search metrics , 2013, CIKM.

[7]  Filip Radlinski,et al.  How does clickthrough data reflect retrieval quality? , 2008, CIKM '08.

[8]  Fernando Diaz,et al.  Learning to aggregate vertical results into web search results , 2011, CIKM '11.

[9]  Katja Hofmann,et al.  Reusing historical interaction data for faster online learning to rank for IR , 2013, DIR.

[10]  Katja Hofmann,et al.  A probabilistic method for inferring preferences from clicks , 2011, CIKM '11.

[11]  M. de Rijke,et al.  Using Intent Information to Model User Behavior in Diversified Search , 2013, DIR.

[12]  Cyril W. Cleverdon,et al.  Factors determining the performance of indexing systems , 1966 .

[13]  Djoerd Hiemstra,et al.  Overview of the TREC 2014 Federated Web Search Track , 2013, TREC.

[14]  Qiang Yang,et al.  Beyond ten blue links: enabling user click modeling in federated web search , 2012, WSDM '12.

[15]  Filip Radlinski,et al.  Large-scale validation and analysis of interleaved search evaluation , 2012, TOIS.

[16]  W. Bruce Croft,et al.  Smoothing Click Counts for Aggregated Vertical Search , 2011, ECIR.

[17]  Katja Hofmann,et al.  Information Retrieval manuscript No. (will be inserted by the editor) Balancing Exploration and Exploitation in Listwise and Pairwise Online Learning to Rank for Information Retrieval , 2022 .

[18]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[19]  Katja Hofmann,et al.  Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods , 2013, TOIS.

[20]  Robert Villa,et al.  Factors affecting click-through behavior in aggregated search interfaces , 2010, CIKM.

[21]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[22]  Fernando Diaz,et al.  A Methodology for Evaluating Aggregated Search Results , 2011, ECIR.

[23]  Katja Hofmann,et al.  Lerot: an online learning to rank framework , 2013, LivingLab '13.

[24]  Floor Sietsma,et al.  Evaluating intuitiveness of vertical-aware click models , 2014, SIGIR.

[25]  M. de Rijke,et al.  Click model-based information retrieval metrics , 2013, SIGIR.

[26]  Thorsten Joachims,et al.  Evaluating Retrieval Performance Using Clickthrough Data , 2003, Text Mining.

[27]  M. de Rijke,et al.  Multileaved Comparisons for Fast Online Evaluation , 2014, CIKM.

[28]  Joemon M. Jose,et al.  Evaluating aggregated search pages , 2012, SIGIR '12.

[29]  Filip Radlinski,et al.  Optimized interleaving for online retrieval evaluation , 2013, WSDM.

[30]  Djoerd Hiemstra,et al.  Aligning Vertical Collection Relevance with User Intent , 2014, CIKM.

[31]  Katja Hofmann,et al.  Evaluating aggregated search using interleaving , 2013, CIKM.

[32]  ChengXiang Zhai,et al.  Evaluation of methods for relative comparison of retrieval systems based on clickthroughs , 2009, CIKM.

[33]  Tapas Kanungo,et al.  On composition of a federated web search result page: using online users to provide pairwise preference for heterogeneous verticals , 2011, WSDM '11.

[34]  Susan T. Dumais,et al.  Optimizing search by showing results in context , 2001, CHI.

[35]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[36]  Joemon M. Jose,et al.  Which vertical search engines are relevant? , 2013, WWW '13.