Capturing collection size for distributed non-cooperative retrieval

Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not available, the size of the underlying collections must be estimated. While several approaches for the estimation of collection size have been proposed, their accuracy has not been thoroughly evaluated. An empirical analysis of past estimation approaches across a variety of collections demonstrates that their prediction accuracy is low. Motivated by ecological techniques for the estimation of animal populations, we propose two new approaches for the estimation of collection size. We show that our approaches are significantly more accurate that previous methods, and are more efficient in use of resources required to perform the estimation.

[1]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[2]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[3]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[4]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[5]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[6]  Peter Bailey,et al.  Engineering a multi-purpose test collection for Web retrieval experiments , 2003, Inf. Process. Manag..

[7]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[8]  Luis Gravano,et al.  Modeling Query-Based Access to Text Databases , 2003, WebDB.

[9]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[10]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[11]  Daryl J. D'Souza,et al.  Collection selection for managed distributed document databases , 2004, Inf. Process. Manag..

[12]  Steven Garcia,et al.  Access-Ordered Indexes , 2004, ACSC.

[13]  Vijay V. Raghavan,et al.  Estimating Size of Search Engines in an Uncooperative Environment , 2004, Workshop on Web-based Support Systems.

[14]  David Hawking,et al.  Overview of the TREC-2002 Web Track , 2002, TREC.

[15]  David Hawking,et al.  Server selection methods in hybrid portal search , 2005, SIGIR '05.

[16]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[17]  Luo Si,et al.  Unified utility maximization framework for resource selection , 2004, CIKM '04.

[18]  Sutherland Ecological Census Techniques , 2006 .

[19]  Luo Si,et al.  A language modeling framework for resource selection and results merging , 2002, CIKM '02.

[20]  Luo Si,et al.  The Effect of Database Size Distribution on Resource Selection Algorithms , 2003, Distributed Multimedia Information Retrieval.

[21]  Donna K. Harman,et al.  Overview of the Sixth Text REtrieval Conference (TREC-6) , 1997, Inf. Process. Manag..

[22]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..