Federated search in the wild: the combined power of over a hundred search engines

Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgements for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.

[1]  David Hawking,et al.  Server selection methods in personal metasearch: a comparative empirical study , 2009, Information Retrieval.

[2]  Milad Shokouhi,et al.  Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[3]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[4]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[5]  Ellen M. Voorhees,et al.  Variations in relevance judgments and the measurement of retrieval effectiveness , 1998, SIGIR '98.

[6]  David Hawking,et al.  Server selection methods in hybrid portal search , 2005, SIGIR '05.

[7]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[8]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[9]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[10]  Fabrizio Silvestri,et al.  Workshop on large-scale distributed systems for information retrieval , 2007, SIGF.

[11]  James C. French,et al.  Obtaining language models of web collections using query-based sampling techniques , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[12]  Ke Zhou,et al.  Evaluating large-scale distributed vertical search , 2011, LSDS-IR '11.

[13]  Djoerd Hiemstra,et al.  Query-Based Sampling using Snippets , 2010, LSDS-IR@SIGIR.

[14]  Charles L. A. Clarke,et al.  Overview of the TREC 2010 Web Track , 2010, TREC.

[15]  Peter Ingwersen,et al.  Developing a Test Collection for the Evaluation of Integrated Search , 2010, ECIR.

[16]  Djoerd Hiemstra,et al.  Ranking XPaths for extracting search result records , 2012 .

[17]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.

[18]  Daryl J. D'Souza,et al.  Is CORI Effective for Collection Selection? An Exploration of Parameters, Queries, and Data , 2004, ADCS.

[19]  Milad Shokouhi,et al.  Federated Search , 2011, Found. Trends Inf. Retr..

[20]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.