Optimized Processing of a Batch of Aggregate Queries over Hidden Databases

A tremendous amount of data is concealed behind form-based interfaces that communicate any user query to their data store to deliver query answer. These interfaces limit the number of retrieved search results to the top-k matching tuples that are sorted using a proprietary ranking function; the database owner may also impose constraints on the type of the processed queries. These limitations make it difficult to process a batch of queries, in this paper; we tackle the problem of processing a batch of aggregate queries with the minimal number of queries sent to the hidden database to overcome the interface limitations. We are proposing a novel technique that makes use of the results of the fired queries to answer new aggregate queries without any additional cost. The proposed method is compared with the classical techniques of processing aggregate queries; it is evaluated through the estimation relative error and query cost. The results show that our method is more efficient than other methods in terms of query cost, so we can process a batch of queries with the minimal cost.

[1]  Gautam Das,et al.  Leveraging History for Faster Sampling of Online Social Networks , 2015, Proc. VLDB Endow..

[2]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[3]  Luis Gravano,et al.  QProber: A system for automatic classification of hidden-Web databases , 2003, TOIS.

[4]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[5]  Jianguo Lu Efficient estimation of the size of text deep web data source , 2008, CIKM '08.

[6]  Xin Jin,et al.  Optimal Algorithms for Crawling a Hidden Database in the Web , 2012, Proc. VLDB Endow..

[7]  Anne E. James,et al.  Sampling, information extraction and summarisation of Hidden Web databases , 2006, Data Knowl. Eng..

[8]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[9]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[10]  Gautam Das,et al.  HDBTracker: Monitoring the Aggregates On Dynamic Hidden Web Databases , 2014, Proc. VLDB Endow..

[11]  Victor Carneiro,et al.  Crawling the Content Hidden Behind Web Forms , 2007, ICCSA.

[12]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[13]  Gautam Das,et al.  Hidden Database Research and Analytics (HYDRA) System , 2015, IEEE Data Eng. Bull..

[14]  Gautam Das,et al.  Turbo-charging hidden database samplers with overflowing queries and skew reduction , 2010, EDBT '10.

[15]  Ashish Ahuja,et al.  Hidden Web Data Extraction Tools , 2013 .

[16]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[17]  Tapabrata Maiti Horvitz-Thompson Estimator , 2011, International Encyclopedia of Statistical Science.

[18]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[19]  Nan Zhang,et al.  Discover Aggregates Exceptions over Hidden Web Databases , 2016, ArXiv.

[20]  Xin Jin,et al.  Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[21]  Mohammad Al Hasan,et al.  Interactive knowledge discovery from hidden data through sampling of frequent patterns , 2016, Stat. Anal. Data Min..

[22]  Jiawei Han,et al.  Discovering complex matchings across web query interfaces: a correlation mining approach , 2004, KDD.

[23]  Gautam Das,et al.  Aggregate Estimation Over Dynamic Hidden Web Databases , 2014, Proc. VLDB Endow..