Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval

The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative environment, query probing — where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon — has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sampling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good coverage based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.

[1]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[2]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[3]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[4]  Peter J. Nürnberg,et al.  Proceedings of the Fifth ACM Conference on Digital Libraries, June 2-7, 2000, San Antonio, TX, USA , 2000 .

[5]  Christos Faloutsos,et al.  Proceedings of the 1999 ACM SIGMOD international conference on Management of data , 1999, SIGMOD 1999.

[6]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[7]  Daryl J. D'Souza,et al.  Is CORI Effective for Collection Selection? An Exploration of Parameters, Queries, and Data , 2004, ADCS.

[8]  Katsumi Tanaka,et al.  Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA) , 1997 .

[9]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[10]  Daryl J. D'Souza,et al.  Collection selection for managed distributed document databases , 2004, Inf. Process. Manag..

[11]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[12]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[13]  Dik Lun Lee,et al.  Server Ranking for Distributed Text Retrieval Systems on the Internet , 1997, DASFAA.

[14]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[15]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[16]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[17]  Mark Sanderson,et al.  Information retrieval system evaluation: effort, sensitivity, and reliability , 2005, SIGIR '05.

[18]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[19]  Luis Gravano,et al.  Classifying and searching hidden-web text databases , 2004 .

[20]  Wenfei Fan,et al.  Incremental evaluation of schema-directed XML publishing , 2004, SIGMOD '04.

[21]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[22]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[23]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.