论文信息 - Aggregate Estimation Over Dynamic Hidden Web Databases - 字舞流文

Aggregate Estimation Over Dynamic Hidden Web Databases

Many databases on the web are "hidden" behind (i.e., accessible only through) their restrictive, form-like, search interfaces. Recent studies have shown that it is possible to estimate aggregate query answers over such hidden web databases by issuing a small number of carefully designed search queries through the restrictive web interface. A problem with these existing work, however, is that they all assume the underlying database to be static, while most real-world web databases (e.g., Amazon, eBay) are frequently updated. In this paper, we study the novel problem of estimating/tracking aggregates over dynamic hidden web databases while adhering to the stringent query-cost limitation they enforce (e.g., at most 1,000 search queries per day). Theoretical analysis and extensive real-world experiments demonstrate the effectiveness of our proposed algorithms and their superiority over baseline solutions (e.g., the repeated execution of algorithms designed for static web databases).

Gautam Das | Nan Zhang | Weimo Liu | Saravanan Thirumuruganathan

[1] Kyuseok Shim,et al. Approximate query processing using wavelets , 2001, The VLDB Journal.

[2] Heikki Mannila,et al. A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[3] Xin Jin,et al. Optimal Algorithms for Crawling a Hidden Database in the Web , 2012, Proc. VLDB Endow..

[4] Sriram Raghavan,et al. Crawling the Hidden Web , 2001, VLDB.

[5] Clement T. Yu,et al. A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration , 2009, Proc. VLDB Endow..

[6] Jeffrey Scott Vitter,et al. Random sampling with a reservoir , 1985, TOMS.

[7] Gautam Das,et al. Turbo-charging hidden database samplers with overflowing queries and skew reduction , 2010, EDBT '10.

[8] Fan Wang,et al. Effective and efficient sampling methods for deep web aggregation queries , 2011, EDBT/ICDT '11.

[9] Andrei Z. Broder,et al. A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[10] ZhangNan,et al. Aggregate estimation over dynamic hidden web databases , 2014, VLDB 2014.

[11] Kevin Chen-Chuan Chang,et al. Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[12] Gautam Das,et al. Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation , 2011, SIGMOD '11.

[13] Jennifer Widom,et al. Continuous queries over data streams , 2001, SGMD.

[14] Fan Wang,et al. Stratified sampling for data mining on the deep web , 2010, 2010 IEEE International Conference on Data Mining.

[15] Surajit Chaudhuri,et al. A robust, optimization-based approach for approximate answering of aggregate queries , 2001, SIGMOD '01.

[16] Surajit Chaudhuri,et al. Optimized stratified sampling for approximate query processing , 2007, TODS.

[17] Andrei Z. Broder,et al. Estimating corpus size via queries , 2006, CIKM '06.

[18] Zachary G. Ives,et al. Adaptive query processing: Why, How, When, and What Next? , 2007, VLDB.

[19] Xin Jin,et al. Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[20] Minos N. Garofalakis,et al. Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[21] Kevin Chen-Chuan Chang,et al. Accessing the web: from search to integration , 2006, SIGMOD Conference.

[22] Gautam Das,et al. Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[23] Paraskevas V. Lekeas,et al. Adaptive-sampling algorithms for answering aggregation queries on Web sites , 2008, Data Knowl. Eng..

[24] Rajeev Motwani,et al. Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[25] Divesh Srivastava,et al. On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[26] Milad Shokouhi,et al. Capturing collection size for distributed non-cooperative retrieval , 2006, SIGIR.

[27] Viswanath Poosala,et al. Fast approximate query answering using precomputed statistics , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[28] M. A. Hamdan,et al. A Note on the Bivariate Poisson Distribution , 1969 .

[29] Yossi Matias,et al. DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[30] Jennifer Widom,et al. Models and issues in data stream systems , 2002, PODS.

[31] Ziv Bar-Yossef,et al. Random Sampling from a Search Engine's Corpus ∗ , 2006 .

[32] Rajeev Rastogi,et al. Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[33] Raghu Ramakrishnan,et al. Managing information extraction: state of the art and research directions , 2006, SIGMOD Conference.

[34] Jiawei Han,et al. Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.