Stratified sampling for data mining on the deep web

In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

[1]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[2]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[3]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[4]  Henrik Grosskreutz,et al.  A Randomized Approach for Approximating the Number of Frequent Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Daniele Braga,et al.  Optimization of multi-domain queries on the web , 2008, Proc. VLDB Endow..

[6]  Andrea Calì,et al.  Querying Data under Access Limitations , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[7]  Abdus Salam,et al.  Mining top−k frequent patterns without minimum support threshold , 2010, Knowledge and Information Systems.

[8]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[9]  Ziv Bar-Yossef,et al.  Mining search engine query logs via suggestion sampling , 2008, Proc. VLDB Endow..

[10]  Chris Jermaine,et al.  Robust Stratified Sampling Plans for Low Selectivity Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Eric R. Ziegel,et al.  Survey Sampling Principles , 1993 .

[12]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[13]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[14]  Ruoming Jin,et al.  New Sampling-Based Estimators for OLAP Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  R. C. Sprinthall Basic Statistical Analysis , 1982 .

[16]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[17]  William H. Press,et al.  Recursive stratified sampling for multidimensional Monte Carlo integration , 1990 .

[18]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[19]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[20]  R. Caflisch Monte Carlo and quasi-Monte Carlo methods , 1998, Acta Numerica.

[21]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[22]  Rasmus Pagh,et al.  Finding associations and computing similarity via biased pair sampling , 2009, Knowledge and Information Systems.

[23]  Chris Jermaine,et al.  Guessing the extreme values in a data set: a Bayesian method and its applications , 2009, The VLDB Journal.

[24]  Ruoming Jin,et al.  SNPMiner: A Domain-Specific Deep Web Mining Tool , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[25]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[26]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[27]  James Bailey,et al.  Mining influential attributes that capture class and group contrast behaviour , 2008, CIKM '08.

[28]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[29]  Maria Cruz Gaya,et al.  Merging local patterns using an evolutionary approach , 2011, Knowledge and Information Systems.

[30]  A. Winsor Sampling techniques. , 2000, Nursing times.

[31]  Paraskevas V. Lekeas,et al.  Adaptive-sampling algorithms for answering aggregation queries on Web sites , 2008, Data Knowl. Eng..

[32]  Xin Jin,et al.  Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[33]  Dimitrios Gunopulos,et al.  Data mining, hypergraph transversals, and machine learning (extended abstract) , 1997, PODS '97.

[34]  D. Rubin Matched Sampling for Causal Effects: Matching to Remove Bias in Observational Studies , 1973 .

[35]  Srinivasan Parthasarathy,et al.  Efficient progressive sampling for association rules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[36]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.