Federated Search

Federated search (federated information retrieval or distributed information retrieval) is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot easily index uncrawlable hidden web collections while federated search systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated search systems need to acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. The goal of this work, is to provide a comprehensive summary of the previous research on the federated search challenges described above.

[1]  Shengli Wu,et al.  Performance prediction of data fusion for information retrieval , 2006, Inf. Process. Manag..

[2]  Milad Shokouhi,et al.  Distributed Text Retrieval From Overlapping Collections , 2007, ADC.

[3]  William W. Cohen Learning Trees and Rules with Set-Valued Features , 1996, AAAI/IAAI, Vol. 1.

[4]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[5]  William P. Birmingham,et al.  Architecture of a metasearch engine that supports user information needs , 1999, CIKM '99.

[6]  Daryl J. D'Souza,et al.  Collection Selection Using n-Term Indexing , 1999, CODAS.

[7]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  James Allan,et al.  INQUERY and TREC-8 , 1998, TREC.

[9]  David Hawking,et al.  Experiences evaluating personal metasearch , 2008, IIiX.

[10]  Jie Lu,et al.  Content-based retrieval in hybrid peer-to-peer networks , 2003, CIKM '03.

[11]  Peter Jackson,et al.  Database Selection Using Actual Physical and Acquired Logical Collection Resources in a Massive Domain-specific Operational Environment , 2002, VLDB.

[12]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[13]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[14]  S. Robertson The probability ranking principle in IR , 1997 .

[15]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[16]  Vijay V. Raghavan,et al.  Estimating Size of Search Engines in an Uncooperative Environment , 2004, Workshop on Web-based Support Systems.

[17]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.

[18]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[19]  Javed A. Aslam,et al.  Models for metasearch , 2001, SIGIR '01.

[20]  Udi Manber,et al.  The Search Broker , 1997, USENIX Symposium on Internet Technologies and Systems.

[21]  Luis Gravano,et al.  Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[22]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[23]  Mounia Lalmas,et al.  Merging techniques for performing data fusion on the web , 2001, CIKM '01.

[24]  Norbert Fuhr,et al.  Resource Discovery in Distributed Digital Libraries , 1999 .

[25]  Nicholas J. Belkin,et al.  Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval , 1997, SIGIR 1997.

[26]  L. R. Rasmussen,et al.  In information retrieval: data structures and algorithms , 1992 .

[27]  David Hawking,et al.  Merging Results From Isolated Search Engines , 1999, Australasian Database Conference.

[28]  King-Lup Liu,et al.  Efficient and effective metasearch for text databases incorporating linkages among documents , 2001, SIGMOD '01.

[29]  Luo Si,et al.  A semisupervised learning method to merge search engine results , 2003, TOIS.

[30]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[31]  Mary Ellen Zurko,et al.  Proceedings of the 10th international conference on World Wide Web , 2001, WWW 2001.

[32]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[33]  Clement T. Yu,et al.  Towards a highly-scalable and effective metasearch engine , 2001, WWW '01.

[34]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[35]  Peter F. Patel-Schneider,et al.  Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.

[36]  Garrison W. Cottrell,et al.  Fusion Via a Linear Combination of Scores , 1999, Information Retrieval.

[37]  Milad Shokouhi,et al.  Central-Rank-Based Collection Selection in Uncooperative Distributed Information Retrieval , 2007, ECIR.

[38]  Fernando Diaz,et al.  Adaptation of offline vertical selection predictions in the presence of user feedback , 2009, SIGIR.

[39]  Oren Etzioni,et al.  The MetaCrawler architecture for resource aggregation on the Web , 1997 .

[40]  Ellen M. Voorhees,et al.  Learning Collection Fusion Strategies for Information Retrieval , 1995, ICML.

[41]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[42]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[43]  Clement T. Yu,et al.  Mining templates from search result records of search engines , 2007, KDD '07.

[44]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[45]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[46]  King-Lup Liu,et al.  Efficient and effective metasearch for a large number of text databases , 1999, CIKM '99.

[47]  Ray R. Larson Distributed IR for Digital Libraries , 2003, ECDL.

[48]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[49]  Luis Gravano,et al.  Classification-aware hidden-web text database selection , 2008, TOIS.

[50]  Jianguo Lu,et al.  Estimating deep web data source size by capture–recapture method , 2010, Information Retrieval.

[51]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[52]  C. Lee Giles,et al.  Inquirus, the NECI Meta Search Engine , 1998, Comput. Networks.

[53]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[54]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[55]  Anne E. James,et al.  Query-related data extraction of hidden web documents , 2004, SIGIR '04.

[56]  Subbarao Kambhampati,et al.  Improving text collection selection with coverage and overlap statistics , 2005, WWW '05.

[57]  Juliana Freire,et al.  Combining classifiers to identify online databases , 2007, WWW '07.

[58]  Shengli Wu,et al.  Distributed Information Retrieval: A Multi-Objective Resource Selection Approach , 2003, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[59]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[60]  Panos Constantopoulos,et al.  Research and Advanced Technology for Digital Libraries , 2001, Lecture Notes in Computer Science.

[61]  Milad Shokouhi,et al.  Compact Features for Detection of Near-Duplicates in Distributed Retrieval , 2006, SPIRE.

[62]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[63]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[64]  Daryl J. D'Souza,et al.  Collection selection for managed distributed document databases , 2004, Inf. Process. Manag..

[65]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[66]  Clement T. Yu,et al.  A highly scalable and effective method for metasearch , 2001, TOIS.

[67]  Bobby L. Hollandsworth,et al.  Griffin search: how Westminster College implemented WebFeat , 2007, Libr. Hi Tech.

[68]  Norbert Fuhr,et al.  Decision-Theoretic Resource Selection for Different Data Types in MIND , 2003, Distributed Multimedia Information Retrieval.

[69]  Charles L. A. Clarke,et al.  Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval , 2007, SIGIR 2007.

[70]  Luo Si,et al.  Learning from past queries for resource selection , 2009, CIKM.

[71]  Fabio Crestani,et al.  Towards personalized distributed information retrieval , 2008, SIGIR '08.

[72]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[73]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[74]  Norbert Fuhr Optimum Database Selection in Networked IR , 1996, Networked Information Retrieval.

[75]  Luis Gravano,et al.  Querying multiple document collections across the internet , 1998 .

[76]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[77]  Fabio Crestani,et al.  Multi-objective resource selection in distributed information retrieval , 2002 .

[78]  C. J. van Rijsbergen,et al.  Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval , 1987, SIGIR 1987.

[79]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[80]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[81]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[82]  Jianguo Lu Efficient estimation of the size of text deep web data source , 2008, CIKM '08.

[83]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[84]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[85]  Jack G. Conrad,et al.  Effective collection metasearch in a hierarchical environment: global vs. localized retrieval performance , 2002, SIGIR '02.

[86]  Sheng Wu,et al.  Estimating collection size with logistic regression , 2007, SIGIR.

[87]  Peter Bruza,et al.  Preliminary Investigations into Ontology-Based Collection Selection , 2006, Aust. J. Intell. Inf. Process. Syst..

[88]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[89]  Dik Lun Lee,et al.  WISE: A World Wide Web Resource Database System , 1996, IEEE Trans. Knowl. Data Eng..

[90]  Dik Lun Lee,et al.  A meta-search method reinforced by cluster descriptors , 2001, Proceedings of the Second International Conference on Web Information Systems Engineering.

[91]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[92]  Ray R. Larson A logistic regression approach to distributed IR , 2002, SIGIR '02.

[93]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[94]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[95]  Milad Shokouhi,et al.  SUSHI : Scoring Scaled Samples for Server Selection , 2009 .

[96]  Luo Si Federated search of text search engines in uncooperative environments , 2007, SIGF.

[97]  Milad Shokouhi,et al.  Using query logs to establish vocabularies in distributed information retrieval , 2007, Inf. Process. Manag..

[98]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[99]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[100]  Peter Bailey,et al.  Overview of the TREC-8 Web Track , 2000, TREC.

[101]  David Hawking,et al.  Server selection methods in personal metasearch: a comparative empirical study , 2009, Information Retrieval.

[102]  Susan T. Dumais,et al.  Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval , 2004, SIGIR 2004.

[103]  Jean-Luc Vidick Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval , 1989, SIGIR 1989.

[104]  Luo Si,et al.  A joint probabilistic classification model for resource selection , 2010, SIGIR '10.

[105]  Oren Etzioni,et al.  Query routing for Web search engines: architecture and experiments , 2000, Comput. Networks.

[106]  Milad Shokouhi,et al.  Federated text retrieval from uncooperative overlapped collections , 2007, SIGIR.

[107]  James C. French,et al.  Database selection in distributed information retrieval: a study of multi-collection information retrieval , 2001 .

[108]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[109]  Nick Craswell,et al.  Methods for Distributed Information Retrieval , 2000 .

[110]  Justin Zobel,et al.  Collection Selection via Lexicon Inspection , 1997 .

[111]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[112]  Proceedings of The Seventh Text REtrieval Conference, TREC 1998, Gaithersburg, Maryland, USA, November 9-11, 1998 , 1998, TREC.

[113]  Guijun Wang,et al.  Information fusion with ProFusion , 1996, WebNet.

[114]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[115]  Hui Chen,et al.  Automatic information discovery from the "invisible Web" , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[116]  Marcel Worring,et al.  NIST Special Publication , 2005 .

[117]  David Hawking,et al.  Result merging strategies for a current news metasearcher , 2003, Inf. Process. Manag..

[118]  W. Bruce Croft,et al.  Blog site search using resource selection , 2008, CIKM '08.

[119]  Luo Si,et al.  Modeling search engine effectiveness for federated search , 2005, SIGIR '05.

[120]  Justin Zobel,et al.  A Scalable System for Identifying Co-derivative Documents , 2004, SPIRE.

[121]  Luo Si,et al.  Unified utility maximization framework for resource selection , 2004, CIKM '04.

[122]  James C. French,et al.  Determining Stopping Criteria in the Generation of Web-Derived Langua ge Models , 2000 .

[123]  P. Willett,et al.  SIGIR '97 : proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, Pennsylvania, USA, July 27-July 31, 1997 , 1997 .

[124]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[125]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[126]  Jaime G. Carbonell,et al.  Retrieval and feedback models for blog feed search , 2008, SIGIR '08.

[127]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[128]  Donald H. Kraft,et al.  Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , 1998, SIGIR 2002.

[129]  Anne E. James,et al.  A Two-Phase Sampling Technique to Improve the Accuracy of Text Similarities in the Categorisation of Hidden Web Databases , 2004, WISE.

[130]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[131]  Anil S. Chakravarthy,et al.  NetSerf: using semantic knowledge to find Internet information archives , 1995, SIGIR '95.

[132]  Kwong Bor Ng,et al.  An investigation of the conditions for effective data fusion in information retrieval , 1998 .

[133]  Fernando Diaz,et al.  Sources of evidence for vertical selection , 2009, SIGIR.

[134]  John Dunnion,et al.  Extending Probabilistic Data Fusion Using Sliding Windows , 2008, ECIR.

[135]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[136]  Daryl J. D'Souza,et al.  A comparison of techniques for selecting text collections , 2000, Proceedings 11th Australasian Database Conference. ADC 2000 (Cat. No.PR00528).

[137]  Jacques Savoy,et al.  Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[138]  Demet Aksoy Information source selection for resource constrained environments , 2005, SGMD.

[139]  Fredric C. Gey,et al.  Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness , 2001, CIKM '01.

[140]  Paul Thomas Generalising multiple capture-recapture to non-uniform sample sizes , 2008, SIGIR '08.

[141]  Anne E. James,et al.  A two-phase sampling technique for information extraction from hidden web databases , 2004, WIDM '04.

[142]  David Hawking,et al.  Overview of TREC-7 Very Large Collection Track , 1997, TREC.

[143]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[144]  Edward A. Fox,et al.  A comparison of two methods for boolean query relevancy feedback , 1984, Inf. Process. Manag..

[145]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[146]  Vijay V. Raghavan,et al.  AllInOneNews: development and evaluation of a large-scale news metasearch engine , 2007, SIGMOD '07.

[147]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[148]  Daryl J. D'Souza,et al.  Is CORI Effective for Collection Selection? An Exploration of Parameters, Queries, and Data , 2004, ADCS.

[149]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[150]  Zygmunt Mazur On a model of distributed information retrieval systems based on thesauri , 1984, Inf. Process. Manag..

[151]  W. Bruce Croft,et al.  Ranking using multiple document types in desktop search , 2010, SIGIR '10.

[152]  Jie Lu,et al.  Full-text federated search in peer-to-peer networks , 2007, SIGF.

[153]  Martijn Koster,et al.  ALIWEB - Archie-like Indexing in the WEB , 1994, Comput. Networks ISDN Syst..

[154]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[155]  Milad Shokouhi,et al.  Effective query expansion for federated search , 2009, SIGIR.

[156]  Shengli Wu,et al.  MIND: resource selection and data fusion in multimedia distributed digital libraries , 2003, SIGIR.

[157]  George Karypis,et al.  Intelligent metasearch engine for knowledge management , 2003, CIKM '03.

[158]  Adele E. Howe,et al.  Experiences with selecting search engines using metasearch , 1997, TOIS.

[159]  Oren Etzioni,et al.  Multi-Service Search and Comparison Using the MetaCrawler , 1995 .

[160]  Fabio Crestani,et al.  Adaptive Query-Based Sampling of Distributed Collections , 2006, SPIRE.

[161]  Vipin Kumar,et al.  Expert agreement and content based reranking in a meta search environment using Mearf , 2002, WWW '02.

[162]  Qiang Wu,et al.  Click-through prediction for news queries , 2009, SIGIR.

[163]  Alistair Moffat,et al.  Methodologies for distributed information retrieval , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[164]  Milad Shokouhi,et al.  Updating collection representations for federated search , 2007, SIGIR.

[165]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[166]  Milad Shokouhi,et al.  Segmentation of Search Engine Results for Effective Data-Fusion , 2007, ECIR.

[167]  King-Lup Liu,et al.  Evaluation of Result Merging Strategies for Metasearch Engines , 2005, WISE.

[168]  Li-Yan Yuan Proceedings of the 18th International Conference on Very Large Data Bases , 1992, VLDB 1992.

[169]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[170]  Guijun Wang,et al.  ProFusion*: Intelligent Fusion from Multiple, Distributed Search Engines , 1996, J. Univers. Comput. Sci..

[171]  Minjie Zhang,et al.  Two-stage statistical language models for text database selection , 2005, Information Retrieval.

[172]  Jayant Madhavan,et al.  Web-Scale Data Integration: You can afford to Pay as You Go , 2007, CIDR.

[173]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[174]  Norbert Fuhr,et al.  The MIND Architecture for Heterogeneous Multimedia Federated Digital Libraries , 2003, Distributed Multimedia Information Retrieval.

[175]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[176]  David Hawking,et al.  Server selection methods in hybrid portal search , 2005, SIGIR '05.

[177]  Donna K. Harman,et al.  Overview of the Fourth Text REtrieval Conference (TREC-4) , 1995, TREC.

[178]  Jack G. Conrad,et al.  Early user---system interaction for database selection in massive domain-specific online environments , 2003, TOIS.

[179]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[180]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[181]  Norbert Fuhr,et al.  Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection , 2004, ECIR.

[182]  Minjie Zhang,et al.  Ontology-based resource descriptions for distributed information sources , 2005, Third International Conference on Information Technology and Applications (ICITA'05).

[183]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[184]  Alistair Moffat,et al.  Information Retrieval Systems for Large Document Collections , 1994, TREC.

[185]  Shengli Wu,et al.  Shadow document methods of resutls merging , 2004, SAC '04.

[186]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[187]  Luo Si,et al.  The Effect of Database Size Distribution on Resource Selection Algorithms , 2003, Distributed Multimedia Information Retrieval.

[188]  Fabio Crestani,et al.  A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval , 2009, ECIR.

[189]  James C. French,et al.  Metrics for evaluating database selection techniques , 2004, World Wide Web.

[190]  David J. DeWitt,et al.  Computing PageRank in a Distributed Internet Search Engine System , 2004, VLDB.

[191]  Michel Beigbeder,et al.  A methodology for collection selection in heterogeneous contexts , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[192]  Mounia Lalmas,et al.  Workshop on aggregated search , 2008, SIGF.

[193]  Jie Lu,et al.  Pruning long documents for distributed information retrieval , 2002, CIKM '02.

[194]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[195]  Ling Liu,et al.  Distributed query sampling: a quality-conscious approach , 2006, SIGIR '06.

[196]  Luo Si,et al.  A language modeling framework for resource selection and results merging , 2002, CIKM '02.

[197]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[198]  Jie Lu,et al.  Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks , 2005, Workshop on Peer-to-Peer Information Retrieval.

[199]  Anne E. James,et al.  Information Extraction from Template-Generated Hidden Web Documents , 2004, ICWI.

[200]  Fernando Diaz,et al.  Integration of news content into web results , 2009, WSDM '09.

[201]  John Dunnion,et al.  ProbFuse: a probabilistic approach to data fusion , 2006, SIGIR.

[202]  James C. French,et al.  The Effects of Query-Based Sampling on Automatic Database Selection Algorithms , 2000 .

[203]  Luo Si,et al.  An effective and efficient results merging strategy for multilingual information retrieval in federated search environments , 2007, Information Retrieval.

[204]  Oren Etzioni,et al.  Multi-Engine Search and Comparison Using the MetaCrawler , 1995, World Wide Web J..

[205]  King-Lup Liu,et al.  Discovering the representative of a search engine , 2001, CIKM '01.

[206]  Fabio Crestani,et al.  Towards better measures: evaluation of estimated resource description quality for distributed IR , 2006, InfoScale '06.

[207]  Jacques Savoy,et al.  Approaches to collection selection and results merging for distributed information retrieval , 2001, CIKM '01.

[208]  Fabio Crestani,et al.  An evaluation of resource description quality measures , 2006, SAC '06.

[209]  James C. French,et al.  Obtaining language models of web collections using query-based sampling techniques , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[210]  Norbert Fuhr,et al.  Evaluating different methods of estimating retrieval quality for resource selection , 2003, SIGIR.

[211]  Luis Gravano,et al.  Modeling and managing content changes in text databases , 2005, 21st International Conference on Data Engineering (ICDE'05).

[212]  Garrison W. Cottrell,et al.  Adaptive combination of evidence for information retrieval , 1999 .

[213]  Jürgen Gross,et al.  Linear Regression , 2003 .

[214]  Jianguo Lu,et al.  An Approach to Deep Web Crawling by Sampling , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[215]  Donna Fryer Federated search engines , 2004 .

[216]  Fredric C. Gey,et al.  Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval , 1999, SIGIR 1999.

[217]  David Hawking,et al.  Evaluating sampling methods for uncooperative collections , 2007, SIGIR.

[218]  Milad Shokouhi,et al.  Robust result merging using sample-based score estimates , 2009, TOIS.

[219]  Milad Shokouhi,et al.  Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval , 2006, APWeb.

[220]  Tatsuya Hagino,et al.  Proceedings of the 14th international conference on World Wide Web , 2005 .

[221]  Luo Si,et al.  The FedLemur project: Federated search in the real world , 2006 .

[222]  Fabio Crestani,et al.  Distributed Multimedia Information Retrieval: Sigir 2003 Workshop on Distributed Information Retrieval, Toronto, Canada, August 2003: Revised, Selected, and Invited Papers (Lecture Notes in Computer Science, 2924) , 2004 .

[223]  Paul Thomas,et al.  Server characterisation and selection for personal metasearch , 2008, SIGF.

[224]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[225]  Luo Si,et al.  Using sampled data and regression to merge search engine results , 2002, SIGIR '02.

[226]  Christoph Baumgarten,et al.  A probabilistic model for distributed information retrieval , 1997, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[227]  Sri Rajan,et al.  In Search of More Meaningful Search , 2006 .

[228]  Christoph Baumgarten,et al.  A probabilistic solution to the selection and fusion problem in distributed information retrieval , 1999, SIGIR '99.

[229]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[230]  Fernando Diaz,et al.  Classification-based resource selection , 2009, CIKM.

[231]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[232]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[233]  Steven Garcia,et al.  Access-Ordered Indexes , 2004, ACSC.

[234]  Jeffrey L. Goldberg,et al.  CDM: an approach to learning in text categorization , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[235]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[236]  Luis Gravano,et al.  SDLIP + STARTS = SDARTS a protocol and toolkit for metasearching , 2001, JCDL '01.

[237]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[238]  Luis Gravano,et al.  Classifying and searching hidden-web text databases , 2004 .

[239]  Kai Ming Ting,et al.  Precision and Recall , 2017, Encyclopedia of Machine Learning and Data Mining.

[240]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[241]  Charles L. A. Clarke,et al.  The TREC terabyte retrieval track , 2005, SIGF.

[242]  Kotagiri Ramamohanarao,et al.  Proceedings of the 27th International Conference on Very Large Data Bases , 2001, VLDB 2001.

[243]  Ellen M. Voorhees,et al.  Multiple search engines in database merging , 1997, DL '97.

[244]  Umberto Straccia,et al.  Web metasearch: rank vs. score based rank aggregation methods , 2003, SAC '03.

[245]  Oren Etzioni,et al.  Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[246]  Amanda Spink,et al.  A study of results overlap and uniqueness among major Web search engines , 2006, Inf. Process. Manag..

[247]  W. Bruce Croft Combining Approaches to Information Retrieval , 2002 .

[248]  Fabio Crestani,et al.  Resource selection and data fusion in multimedia distributed digital libraries , 2003, SIGIR.

[249]  Alistair Moffat,et al.  Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval , 2005, SIGIR 2005.

[250]  King-Lup Liu,et al.  A Methodology to Retrieve Text Documents from Multiple Databases , 2002, IEEE Trans. Knowl. Data Eng..

[251]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[252]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[253]  Javed A. Aslam,et al.  A unified model for metasearch, pooling, and system evaluation , 2003, CIKM '03.

[254]  Mark Sanderson,et al.  Experiments on data fusion using headline information , 2002, SIGIR '02.

[255]  Luo Si,et al.  CLEF 2005: Multilingual Retrieval by Combining Multiple Multilingual Ranked Lists , 2005, CLEF.

[256]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[257]  Jie Lu,et al.  Reducing Storage Costs for Federated Search of Text Databases , 2003, DG.O.

[258]  James P. Callan,et al.  The effectiveness of query expansion for distributed information retrieval , 2001, CIKM '01.

[259]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[260]  Clement T. Yu,et al.  Automatic extraction of dynamic record sections from search engine result pages , 2006, VLDB.

[261]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[262]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[263]  Clement T. Yu,et al.  Advanced Metasearch Engine Technology , 2010, Advanced Metasearch Engine Technology.

[264]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[265]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[266]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[267]  Jie Lu,et al.  User modeling for full-text federated search in peer-to-peer networks , 2006, SIGIR '06.