From federated to aggregated search

Federated search refers to the brokered retrieval of content from a set of auxiliary retrieval systems instead of from a single, centralized retrieval system. Federated search tasks occur in, for example, digital libraries (where documents from several retrieval systems must be seamlessly merged) or peer-to-peer information retrieval (where documents distributed across a network of local indexes must be retrieved). In the context of web search, aggregated search refers to the integration of non-web content (e.g. images, videos, news articles, maps, tweets) into a web search result page. This is in contrast with classic web search where users are presented with a ranked list consisting exclusively of general web documents. As in other federated search situations, the non-web content is often retrieved from auxiliary retrieval systems (e.g. image or video databases, news indexes). Although aggregated search can be seen as an instance of federated search, several aspects make aggregated search a unique and compelling research topic. These include large sources of evidence (e.g. click logs) for deciding what non-web items to return, constrained interfaces (e.g. mobile screens), and a very heterogeneous set of available auxiliary resources (e.g. images, videos, maps, news articles). Each of these aspects introduces problems and opportunities not addressed in the federated search literature. Aggregated search is an important future research direction for information retrieval. All major search engines now provide aggregated search results. As the number of available auxiliary resources grows, deciding how to effectively surface content from each will become increasingly important. The goal of this tutorial is to provide an overview of federated search and aggregated search techniques for an intermediate information retrieval researcher. At the same time, the content will be valuable for practitioners in industry. We will take the audience through the most influential work in these areas and describe how they relate to real world aggregated search systems. We will also list some of the new challenges confronted in aggregated search and discuss directions for future work.

[1]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[2]  Clement T. Yu,et al.  A highly scalable and effective method for metasearch , 2001, TOIS.

[3]  Luo Si,et al.  Learning from past queries for resource selection , 2009, CIKM.

[4]  Luis Gravano,et al.  Modeling and managing content changes in text databases , 2005, 21st International Conference on Data Engineering (ICDE'05).

[5]  Dik Lun Lee,et al.  WISE: A World Wide Web Resource Database System , 1996, IEEE Trans. Knowl. Data Eng..

[6]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[7]  Milad Shokouhi,et al.  Segmentation of Search Engine Results for Effective Data-Fusion , 2007, ECIR.

[8]  William P. Birmingham,et al.  Architecture of a metasearch engine that supports user information needs , 1999, CIKM '99.

[9]  Milad Shokouhi,et al.  Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[10]  James P. Callan,et al.  The effectiveness of query expansion for distributed information retrieval , 2001, CIKM '01.

[11]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[12]  Anil S. Chakravarthy,et al.  NetSerf: using semantic knowledge to find Internet information archives , 1995, SIGIR '95.

[13]  Mounia Lalmas,et al.  Dynamics of Genre and Domain Intents , 2010, AIRS.

[14]  Paul Thomas,et al.  Server characterisation and selection for personal metasearch , 2008, SIGF.

[15]  Robert Villa,et al.  Factors affecting click-through behavior in aggregated search interfaces , 2010, CIKM.

[16]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[17]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[18]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[19]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[20]  Joemon M. Jose,et al.  Understanding domain "relevance" in web search , 2009 .

[21]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[22]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[23]  Norbert Fuhr,et al.  Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection , 2004, ECIR.

[24]  Mounia Lalmas,et al.  Merging techniques for performing data fusion on the web , 2001, CIKM '01.

[25]  Fernando Diaz,et al.  Performance prediction using spatial autocorrelation , 2007, SIGIR.

[26]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[27]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[28]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[29]  Luis Gravano,et al.  Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[30]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[31]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[32]  Milad Shokouhi,et al.  Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[33]  Nick Craswell,et al.  Methods for Distributed Information Retrieval , 2000 .

[34]  Milad Shokouhi,et al.  SUSHI : Scoring Scaled Samples for Server Selection , 2009 .

[35]  Xiao Li,et al.  Learning query intent from regularized click graphs , 2008, SIGIR '08.

[36]  Fernando Diaz,et al.  Vertical selection in the presence of unlabeled verticals , 2010, SIGIR '10.

[37]  Milad Shokouhi,et al.  Using query logs to establish vocabularies in distributed information retrieval , 2007, Inf. Process. Manag..

[38]  Garrison W. Cottrell,et al.  Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[39]  Luis Gravano,et al.  Classification-aware hidden-web text database selection , 2008, TOIS.

[40]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[41]  Mounia Lalmas,et al.  A Task-Based Evaluation of an Aggregated Search Interface , 2009, SPIRE.

[42]  EtzioniOren,et al.  Query routing for Web search engines , 2000 .

[43]  Subbarao Kambhampati,et al.  Improving text collection selection with coverage and overlap statistics , 2005, WWW '05.

[44]  Weiguo Fan,et al.  Identifying vertical search intention of query through social tagging propagation , 2009, WWW '09.

[45]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[46]  Luo Si,et al.  Modeling search engine effectiveness for federated search , 2005, SIGIR '05.

[47]  W. Bruce Croft,et al.  Blog site search using resource selection , 2008, CIKM '08.

[48]  Milad Shokouhi,et al.  Effective query expansion for federated search , 2009, SIGIR.

[49]  Fabio Crestani,et al.  A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval , 2009, ECIR.

[50]  James C. French,et al.  Metrics for evaluating database selection techniques , 2004, World Wide Web.

[51]  Mounia Lalmas,et al.  Workshop on aggregated search , 2008, SIGF.

[52]  Ling Liu,et al.  Distributed query sampling: a quality-conscious approach , 2006, SIGIR '06.

[53]  David Hawking,et al.  Evaluating sampling methods for uncooperative collections , 2007, SIGIR.

[54]  Milad Shokouhi,et al.  Robust result merging using sample-based score estimates , 2009, TOIS.

[55]  Milad Shokouhi,et al.  Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval , 2006, APWeb.

[56]  Ophir Frieder,et al.  Automatic classification of Web queries using very large unlabeled query logs , 2007, TOIS.

[57]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[58]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[59]  Fernando Diaz,et al.  Adaptation of offline vertical selection predictions in the presence of user feedback , 2009, SIGIR.

[60]  Paul Thomas,et al.  Focused and aggregated search: a perspective from natural language generation , 2010, Information Retrieval.

[61]  Luo Si,et al.  A semisupervised learning method to merge search engine results , 2003, TOIS.

[62]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[63]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[64]  Fernando Diaz,et al.  Classification-based resource selection , 2009, CIKM.

[65]  Soyeon Park,et al.  Analysis of characteristics and trends of Web queries submitted to NAVER, a major Korean search engine , 2009 .

[66]  Andrew Trotman,et al.  Current research in focused retrieval and result aggregation , 2010, Information Retrieval.

[67]  Luo Si,et al.  Unified utility maximization framework for resource selection , 2004, CIKM '04.

[68]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[69]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[70]  Claudia Hauff,et al.  Predicting the effectiveness of queries and retrieval systems , 2010, SIGF.

[71]  David Hawking,et al.  Server selection methods in hybrid portal search , 2005, SIGIR '05.

[72]  Fernando Diaz,et al.  Integration of news content into web results , 2009, WSDM '09.

[73]  John Dunnion,et al.  ProbFuse: a probabilistic approach to data fusion , 2006, SIGIR.

[74]  Luo Si,et al.  An effective and efficient results merging strategy for multilingual information retrieval in federated search environments , 2007, Information Retrieval.

[75]  King-Lup Liu,et al.  Discovering the representative of a search engine , 2001, CIKM '01.

[76]  Fabio Crestani,et al.  Towards better measures: evaluation of estimated resource description quality for distributed IR , 2006, InfoScale '06.

[77]  Milad Shokouhi,et al.  Compact Features for Detection of Near-Duplicates in Distributed Retrieval , 2006, SPIRE.

[78]  Fabio Crestani,et al.  Adaptive Query-Based Sampling of Distributed Collections , 2006, SPIRE.

[79]  Qiang Wu,et al.  Click-through prediction for news queries , 2009, SIGIR.