Aggregate Estimation Over Dynamic Hidden Web Databases

Many databases on the web are "hidden" behind (i.e., accessible only through) their restrictive, form-like, search interfaces. Recent studies have shown that it is possible to estimate aggregate query answers over such hidden web databases by issuing a small number of carefully designed search queries through the restrictive web interface. A problem with these existing work, however, is that they all assume the underlying database to be static, while most real-world web databases (e.g., Amazon, eBay) are frequently updated. In this paper, we study the novel problem of estimating/tracking aggregates over dynamic hidden web databases while adhering to the stringent query-cost limitation they enforce (e.g., at most 1,000 search queries per day). Theoretical analysis and extensive real-world experiments demonstrate the effectiveness of our proposed algorithms and their superiority over baseline solutions (e.g., the repeated execution of algorithms designed for static web databases).

[1]  Kyuseok Shim,et al.  Approximate query processing using wavelets , 2001, The VLDB Journal.

[2]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[3]  Xin Jin,et al.  Optimal Algorithms for Crawling a Hidden Database in the Web , 2012, Proc. VLDB Endow..

[4]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[5]  Clement T. Yu,et al.  A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration , 2009, Proc. VLDB Endow..

[6]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[7]  Gautam Das,et al.  Turbo-charging hidden database samplers with overflowing queries and skew reduction , 2010, EDBT '10.

[8]  Fan Wang,et al.  Effective and efficient sampling methods for deep web aggregation queries , 2011, EDBT/ICDT '11.

[9]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[10]  ZhangNan,et al.  Aggregate estimation over dynamic hidden web databases , 2014, VLDB 2014.

[11]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[12]  Gautam Das,et al.  Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation , 2011, SIGMOD '11.

[13]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[14]  Fan Wang,et al.  Stratified sampling for data mining on the deep web , 2010, 2010 IEEE International Conference on Data Mining.

[15]  Surajit Chaudhuri,et al.  A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[16]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[17]  Andrei Z. Broder,et al.  Estimating corpus size via queries , 2006, CIKM '06.

[18]  Zachary G. Ives,et al.  Adaptive query processing: Why, How, When, and What Next? , 2007, VLDB.

[19]  Xin Jin,et al.  Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[20]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[21]  Kevin Chen-Chuan Chang,et al.  Accessing the web: from search to integration , 2006, SIGMOD Conference.

[22]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[23]  Paraskevas V. Lekeas,et al.  Adaptive-sampling algorithms for answering aggregation queries on Web sites , 2008, Data Knowl. Eng..

[24]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[25]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[26]  Milad Shokouhi,et al.  Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[27]  Viswanath Poosala,et al.  Fast approximate query answering using precomputed statistics , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[28]  M. A. Hamdan,et al.  A Note on the Bivariate Poisson Distribution , 1969 .

[29]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[30]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[31]  Ziv Bar-Yossef,et al.  Random Sampling from a Search Engine's Corpus ∗ , 2006 .

[32]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[33]  Raghu Ramakrishnan,et al.  Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[34]  Jiawei Han,et al.  Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.