Learning Ensembles from Bites: A Scalable and Accurate Approach

Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.

[1]  Stuart J. Russell,et al.  Decision Theoretic Subsampling for Induction on Large Databases , 1993, ICML.

[2]  L. Breiman Pasting Bites Together For Prediction In Large Data Sets And On-Line , 1996 .

[3]  Lawrence O. Hall,et al.  A New Ensemble Diversity Measure Applied to Thinning Ensembles , 2003, Multiple Classifier Systems.

[4]  Nitesh V. Chawla,et al.  Distributed learning with bagging-like performance , 2003, Pattern Recognit. Lett..

[5]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[6]  Leo Breiman,et al.  Pasting Small Votes for Classification in Large Databases and On-Line , 1999, Machine Learning.

[7]  Robert P. W. Duin,et al.  Is independence good for combining classifiers? , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[8]  Nitesh V. Chawla,et al.  Investigation of bagging-like effects and decision trees versus neural nets in protein secondary structure prediction , 2001, BIOKDD.

[9]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[10]  Lawrence O. Hall,et al.  AVATAR -- Adaptive Visualization Aid for Touring And Recovery , 2000 .

[11]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[12]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[13]  Zoran Obradovic,et al.  Boosting Algorithms for Parallel and Distributed Learning , 2022 .

[14]  Nitesh V. Chawla,et al.  Creating ensembles of classifiers , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[15]  Nitesh Chawla Steven Eschrich,et al.  Learning to Predict in Complex Biological Domains , 2004 .

[16]  StevenEschrich Learning to Predict in Complex Biological Domains , 2002 .

[17]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[18]  Foster J. Provost,et al.  Scaling Up: Distributed Machine Learning with Cooperation , 1996, AAAI/IAAI, Vol. 1.

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Olivier Debeir,et al.  Limiting the Number of Trees in Random Forests , 2001, Multiple Classifier Systems.

[21]  Nitesh V. Chawla,et al.  Learning Rules from Distributed Data , 1999, Large-Scale Parallel Data Mining.

[22]  David B. Skalak,et al.  The Sources of Increased Accuracy for Two Proposed Boosting Algorithms , 1996, AAAI 1996.

[23]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[24]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[25]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[26]  Salvatore J. Stolfo,et al.  Toward parallel and distributed learning by meta-learning , 1993 .

[27]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[28]  Pedro M. Domingos Using Partitioning to Speed Up Specific-to-General Rule Induction , 1996 .

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[31]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[32]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Fabio Roli,et al.  An approach to the automatic design of multiple classifier systems , 2001, Pattern Recognit. Lett..

[34]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[35]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[36]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[37]  Nitesh V. Chawla,et al.  Distributed Pasting of Small Votes , 2002, Multiple Classifier Systems.

[38]  Neal Leavitt,et al.  Data Mining for the Corporate Masses? , 2002, Computer.

[39]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[40]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[41]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.