Query-based sampling of text databases

The proliferation of searchable text databases on corporate networks and the Internet causes a database selection problem for many people. Algorithms such as gGLOSS and CORI can automatically select which text databases to search for a given information need, but only if given a set of resource descriptions that accurately represent the contents of each database. The existing techniques for a acquiring resource descriptions have significant limitations when used in wide-area networks controlled by many parties. This paper presents query-based sampling, a new technicque for acquiring accurate resource descriptions. Query-based sampling does not require the cooperation of resource providers, nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are crated, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic dtabase selection.

[1]  William H. Press,et al.  Numerical Recipes in Fortran 77: The Art of Scientific Computing 2nd Editionn - Volume 1 of Fortran Numerical Recipes , 1992 .

[2]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[3]  King-Lup Liu,et al.  Estimating the usefulness of search engines , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  James P. Callan,et al.  Collection selection and results merging with topically organized U.S. patents and TREC data , 2000, CIKM '00.

[5]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[6]  CroftComputer,et al.  Measures in Collection Ranking EvaluationZhihong , 1996 .

[7]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[8]  Christoph Baumgarten,et al.  A probabilistic model for distributed information retrieval , 1997, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[9]  James C. French,et al.  Dissemination of collection wide information in a distributed information retrieval system , 1995, SIGIR '95.

[10]  W. Bruce Croft,et al.  TREC and Tipster Experiments with Inquery , 1995, Inf. Process. Manag..

[11]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[12]  James Allan,et al.  Recent Experiments with INQUERY , 1995, TREC.

[13]  Organización Internacional de Normalización ISO 23950 : Information and documentation -- Information retrieval (Z39.50) -- Application service definition and protocol specification , 1998 .

[14]  Luis Gravano,et al.  STARTS: Stanford proposal for Internet meta-searching , 1997, SIGMOD '97.

[15]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[16]  Donna Harman,et al.  The Second Text Retrieval Conference (TREC-2) , 1995, Inf. Process. Manag..

[17]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[18]  Richard S. Marcus,et al.  An experimental comparison of the effectiveness of computers and humans as search intermediaries , 1983, J. Am. Soc. Inf. Sci..

[19]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[20]  James P. Callan,et al.  Effective retrieval with distributed collections , 1998, SIGIR '98.

[21]  Ellen M. Voorhees,et al.  Multiple search engines in database merging , 1997, DL '97.

[22]  M. Moroney,et al.  Facts in figures , 1952 .

[23]  R. K. Wiersba Review of "Information Retrieval: Computational and Theoretical Aspects, by H. S. Heaps", Academic Press Inc. , 1980, SIGF.

[24]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[25]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[26]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[27]  CallanJamie,et al.  Query-based sampling of text databases , 2001 .

[28]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[29]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[30]  Chanathip Namprempre,et al.  HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.

[31]  Luis Gravano,et al.  Precision and recall of GlOSS estimators for database discovery , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[32]  Dik Lun Lee,et al.  Search and ranking algorithms for locating resources on the World Wide Web , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[33]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[34]  Ian Clarke,et al.  Freenet: A Distributed Anonymous Information Storage and Retrieval System , 2000, Workshop on Design Issues in Anonymity and Unobservability.

[35]  Luis Gravano,et al.  Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies , 1995, VLDB.

[36]  Robert Krovetz,et al.  Word sense disambiguation for large text databases , 1996 .

[37]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[38]  James C. French,et al.  The impact of database selection on distributed searching , 2000, SIGIR '00.

[39]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[40]  King-Lup Liu,et al.  Determining Text Databases to Search in the Internet , 1998, VLDB.

[41]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[42]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[43]  Ellen M. Voorhees,et al.  Learning collection fusion strategies , 1995, SIGIR '95.

[44]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[45]  James Allan,et al.  INQUERY and TREC-8 , 1998, TREC.

[46]  David Hawking,et al.  Methods for information server selection , 1999, TOIS.