Estimation Methods for the Size of Deep Web Textural Data Source: A Survey

The estimation of the size of deep web data sources has been an open problem since 1998. This survey reviews all papers that were available online, and other, resources, on estimating the size of data sources during the period 1998 to 2008. In the survey, we rst clarify several basic terms that are used in the survey but whose meanings vary in the literature. Basic models in the literature on estimation are also discussed. The survey introduces query-based sampling approaches and reviews the estimation methods of estimating relative size and actual size of data source(s). Querybased sampling is biased. The survey also reviews research on overcoming biases caused by various estimation methods. Finally, the future direction of estimation is discussed.

[1]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[2]  Giles,et al.  Searching the world wide Web , 1998, Science.

[3]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[4]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[5]  Vijay V. Raghavan,et al.  Estimating Size of Search Engines in an Uncooperative Environment , 2004, Workshop on Web-based Support Systems.

[6]  Rajeev Motwani,et al.  Estimating Sum by Weighted Sampling , 2007, ICALP.

[7]  King-Lup Liu,et al.  Discovering the representative of a search engine , 2001, CIKM '01.

[8]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[9]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[10]  Stephen E. Fienberg,et al.  How Large Is the World Wide Web , 2004 .

[11]  Meng Xiaofeng,et al.  An Attributes Correlation Based Approach for Estimating Size of Web Databases , 2008 .

[12]  J. Alho Logistic regression in capture-recapture models. , 1990, Biometrics.

[13]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[14]  Sofía N. Galicia-Haro,et al.  Can We Correctly Estimate the Total Number of Pages in Google for a Specific Language? , 2003, CICLing.

[15]  Bryan F. J. Manly,et al.  Handbook of Capture-Recapture Analysis , 2010 .

[16]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[17]  Abbe Mowshowitz,et al.  Measuring search engine bias , 2005, Inf. Process. Manag..

[18]  A Chao,et al.  Estimating population size via sample coverage for closed capture-recapture models. , 1994, Biometrics.

[19]  L. Holst A UNIFIED APPROACH TO LIMIT THEOREMS FOR URN MODELS , 1979 .

[20]  Lionel C. Briand,et al.  A comparison and integration of capture-recapture models and the detection profile method , 1998, Proceedings Ninth International Symposium on Software Reliability Engineering (Cat. No.98TB100257).

[21]  K. Burnham,et al.  Robust Estimation of Population Size When Capture Probabilities Vary Among Animals , 1979 .

[22]  G. Casella,et al.  Rao-Blackwellisation of sampling schemes , 1996 .

[23]  Paul Bourret How to Estimate the Sizes of Domains , 1984, Inf. Process. Lett..

[24]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[25]  Jianguo Lu Efficient estimation of the size of text deep web data source , 2008, CIKM '08.

[26]  Sheng Wu,et al.  Estimating collection size with logistic regression , 2007, SIGIR.

[27]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[28]  Shengli Wu,et al.  Experiments with Document Archive Size Detection , 2003, ECIR.

[29]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[30]  Alexander Gelbukh,et al.  Computational Linguistics and Intelligent Text Processing , 2015, Lecture Notes in Computer Science.

[31]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[32]  A. Chao Estimating the population size for capture-recapture data with unequal catchability. , 1987, Biometrics.

[33]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.