Efficient estimation of the size of text deep web data source

This paper addresses the problem of estimating the size of a deep web data source that is accessible by queries only. Since most deep webs are non-cooperative, a data source size can only be estimated by sending queries and analyzing the returning results. We propose an efficient estimator based on capturerecapture method. First we derive an equation between the overlapping rate and the percentage of the data examined when random samples are available. This equation is conceptually simple, and leads to the derivation of an estimator for samples obtained by random queries. Since random queries do not produce random documents, it is well known that estimation by random queries will consistently result in negative bias. Based on the simple estimator for random samples, we adjust the equation so that it can handle the samples returned by random queries. We carried out experiments on several data sources including Amazon Ecommerce web service, and compared with the approach derived from the traditional capture-recapture methods. The results show that our method has a smaller bias and deviation.

[1]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[2]  Luis Gravano,et al.  Probe, count, and classify: categorizing hidden web databases , 2001, SIGMOD '01.

[3]  Sheng Wu,et al.  Estimating collection size with logistic regression , 2007, SIGIR.

[4]  Sofía N. Galicia-Haro,et al.  Can We Correctly Estimate the Total Number of Pages in Google for a Specific Language? , 2003, CICLing.

[5]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[6]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[7]  David Hawking,et al.  Evaluating sampling methods for uncooperative collections , 2007, SIGIR.

[8]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[9]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[10]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[11]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[12]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[13]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[14]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[15]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[16]  Wei-Ying Ma,et al.  Query Selection Techniques for Efficient Crawling of Structured Web Sources , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  L. Holst A UNIFIED APPROACH TO LIMIT THEOREMS FOR URN MODELS , 1979 .

[18]  Michael L. Nelson,et al.  Efficient, automatic web resource harvesting , 2006, WIDM '06.

[19]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[20]  Ling Liu,et al.  Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web , 2004, Proceedings. 20th International Conference on Data Engineering.

[21]  Stephen E. Fienberg,et al.  How Large Is the World Wide Web , 2004 .

[22]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[23]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[24]  A. Chao Estimating the population size for capture-recapture data with unequal catchability. , 1987, Biometrics.

[25]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[26]  Shengli Wu,et al.  Experiments with Document Archive Size Detection , 2003, ECIR.

[27]  Sourav S. Bhowmick,et al.  DEQUE: querying the deep web , 2005, Data Knowl. Eng..

[28]  Bryan F. J. Manly,et al.  Handbook of Capture-Recapture Analysis , 2010 .

[29]  Paul Bourret How to Estimate the Sizes of Domains , 1984, Inf. Process. Lett..

[30]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[31]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.