Leveraging COUNT Information in Sampling Hidden Databases

A large number of online databases are hidden behind form-like interfaces which allow users to execute search queries by specifying selection conditions in the interface. Most of these interfaces return restricted answers (e.g., only top-k of the selected tuples), while many of them also accompany each answer with the COUNT of the selected tuples. In this paper, we propose techniques which leverage the COUNT information to ef¿ciently acquire unbiased samples of the hidden database. We also discuss variants for interfaces which do not provide COUNTinformation. We conduct extensive experiments to illustrate the ef¿ciency and accuracy of our techniques.

[1]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[2]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[3]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[4]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[5]  Minos N. Garofalakis,et al.  Approximate Query Processing: Taming the TeraBytes , 2001, VLDB.

[6]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[7]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[8]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[9]  David W. Embley,et al.  Extracting Data behind Web Forms , 2002, ER.

[10]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[12]  Luis Gravano,et al.  Modeling Query-Based Access to Text Databases , 2003, WebDB.

[13]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[14]  Anne E. James,et al.  A two-phase sampling technique for information extraction from hidden web databases , 2004, WIDM '04.

[15]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[16]  Anne E. James,et al.  Sampling, information extraction and summarisation of Hidden Web databases , 2006, Data Knowl. Eng..

[17]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[18]  Victor Carneiro,et al.  Crawling the Content Hidden Behind Web Forms , 2007, ICCSA.

[19]  Mukesh K. Mohania,et al.  Decision trees for entity identification: approximation algorithms and hardness results , 2007, TALG.

[20]  Seung-won Hwang,et al.  Probe Minimization by Schedule Optimization: Supporting Top-K Queries with Expensive Predicates , 2007, IEEE Transactions on Knowledge and Data Engineering.

[21]  Dimitrios Gunopulos,et al.  Efficient Approximate Query Processing in Peer-to-Peer Networks , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[23]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..