Scaling Up a Boosting-Based Learner via Adaptive Sampling

In this paper we present a experimental evaluation of a boosting based learning system and show that can be run efficiently over a large dataset. The system uses as base learner decision stumps, single atribute decision trees with only two terminal nodes. To select the best decision stump at each iteration we use an adaptive sampling method. As a boosting algorithm, we use a modification of AdaBoost that is suitable to be combined with a base learner that does not use all the dataset. We provide experimental evidence that our method is as accurate as the equivalent algorithm that uses all the dataset but much faster.

[1]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[2]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[3]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[4]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[5]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[6]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[7]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[8]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[9]  Osamu Watanabe,et al.  Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms , 1999, Discovery Science.

[10]  Osamu Watanabe From Computational Learning Theory to Discovery Science , 1999, ICALP.

[11]  Osamu Watanabe,et al.  Practical Algorithms for On-line Sampling , 1998, Discovery Science.

[12]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[13]  Osamu Watanabe,et al.  MadaBoost: A Modification of AdaBoost , 2000, COLT.

[14]  Jeffrey F. Naughton,et al.  Efficient Sampling Strategies for Relational Database Operations , 1993, Theor. Comput. Sci..

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Jeffrey F. Naughton,et al.  Query Size Estimation by Adaptive Sampling , 1995, J. Comput. Syst. Sci..

[17]  Erwin Kreyszig,et al.  Introductory Mathematical Statistics. , 1970 .

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .