Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Scalability is a key requirement for any KDD and data mining algorithm, and one of the biggest research challenges is to develop methods that allow to use large amounts of data. One possible approach for dealing with huge amounts of data is to take a random sample and do data mining on it, since for many data mining applications approximate answers are acceptable. However, as argued by several researchers, random sampling is difficult to use due to the difficulty of determining an appropriate sample size. In this paper, we take a sequential sampling approach for solving this difficulty, and propose an adaptive sampling method that solves a general problem covering many actual problems arising in applications of discovery science. An algorithm following this method obtains examples sequentially in an on-line fashion, and it determines from the obtained examples whether it has already seen a large enough number of examples. Thus, sample size is not fixed a priori; instead, it adaptively depends on the situation. Due to this adaptiveness, if we are not in a worst case situation as fortunately happens in many practical applications, then we can solve the problem with a number of examples much smaller than required in the worst case. We prove the correctness of our method and estimates its efficiency theoretically. For illustrating its usefulness, we consider one concrete task requiring sampling, provide an algorithm based on our method, and show its efficiency experimentally.

[1]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[2]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[3]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[4]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[5]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[6]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[7]  Osamu Watanabe,et al.  MadaBoost: A Modification of AdaBoost , 2000, COLT.

[8]  Stuart J. Russell,et al.  Decision Theoretic Subsampling for Induction on Large Databases , 1993, ICML.

[9]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[10]  Jeffrey Scott Vitter,et al.  Scalable mining for classification rules in relational databases , 1998 .

[11]  Osamu Watanabe,et al.  Practical Algorithms for On-line Sampling , 1998, Discovery Science.

[12]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[13]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[14]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[15]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[16]  Osamu Watanabe,et al.  Scaling Up a Boosting-Based Learner via Adaptive Sampling , 2000, PAKDD.

[17]  Heikki Mannila,et al.  The power of sampling in knowledge discovery , 1994, PODS '94.

[18]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[19]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[20]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[21]  Andrew W. Moore,et al.  Efficient Algorithms for Minimizing Cross Validation Error , 1994, ICML.

[22]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  Osamu Watanabe,et al.  On-line sampling methods for discovering association rules , 1999 .

[25]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[26]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .