论文信息 - Learning Ensembles from Bites: A Scalable and Accurate Approach

Learning Ensembles from Bites: A Scalable and Accurate Approach

Bagging and boosting are two popular ensemble methods that typically achieve better accuracy than a single classifier. These techniques have limitations on massive data sets, because the size of the data set can be a bottleneck. Voting many classifiers built on small subsets of data ("pasting small votes") is a promising approach for learning from massive data sets, one that can utilize the power of boosting and bagging. We propose a framework for building hundreds or thousands of such classifiers on small subsets of data in a distributed environment. Experiments show this approach is fast, accurate, and scalable.

[1] Stuart J. Russell,et al. Decision Theoretic Subsampling for Induction on Large Databases , 1993, ICML.

[2] L. Breiman. Pasting Bites Together For Prediction In Large Data Sets And On-Line , 1996 .

[3] Lawrence O. Hall,et al. A New Ensemble Diversity Measure Applied to Thinning Ensembles , 2003, Multiple Classifier Systems.

[4] Nitesh V. Chawla,et al. Distributed learning with bagging-like performance , 2003, Pattern Recognit. Lett..

[5] D T Jones,et al. Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[6] Leo Breiman,et al. Pasting Small Votes for Classification in Large Databases and On-Line , 1999, Machine Learning.

[7] Robert P. W. Duin,et al. Is independence good for combining classifiers? , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[8] Nitesh V. Chawla,et al. Investigation of bagging-like effects and decision trees versus neural nets in protein secondary structure prediction , 2001, BIOKDD.

[9] Yoav Freund,et al. Experiments with a New Boosting Algorithm , 1996, ICML.

[10] Lawrence O. Hall,et al. AVATAR -- Adaptive Visualization Aid for Touring And Recovery , 2000 .

[11] Foster J. Provost,et al. A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[12] Irving John Good,et al. The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[13] Zoran Obradovic,et al. Boosting Algorithms for Parallel and Distributed Learning , 2022 .

[14] Nitesh V. Chawla,et al. Creating ensembles of classifiers , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[15] Nitesh Chawla Steven Eschrich,et al. Learning to Predict in Complex Biological Domains , 2004 .

[16] StevenEschrich. Learning to Predict in Complex Biological Domains , 2002 .

[17] Jeffrey S. Simonoff,et al. Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[18] Foster J. Provost,et al. Scaling Up: Distributed Machine Learning with Cooperation , 1996, AAAI/IAAI, Vol. 1.

[19] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[20] Olivier Debeir,et al. Limiting the Number of Trees in Random Forests , 2001, Multiple Classifier Systems.

[21] Nitesh V. Chawla,et al. Learning Rules from Distributed Data , 1999, Large-Scale Parallel Data Mining.

[22] David B. Skalak,et al. The Sources of Increased Accuracy for Two Proposed Boosting Algorithms , 1996, AAAI 1996.

[23] Eric Bauer,et al. An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[24] T. N. Bhat,et al. The Protein Data Bank , 2000, Nucleic Acids Res..

[25] Christian Lebiere,et al. The Cascade-Correlation Learning Architecture , 1989, NIPS.

[26] Salvatore J. Stolfo,et al. Toward parallel and distributed learning by meta-learning , 1993 .

[27] Thomas G. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[28] Pedro M. Domingos. Using Partitioning to Speed Up Specific-to-General Rule Induction , 1996 .

[29] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[30] Tim Oates,et al. Efficient progressive sampling , 1999, KDD '99.

[31] Alberto Maria Segre,et al. Programs for Machine Learning , 1994 .

[32] Tin Kam Ho,et al. The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[33] Fabio Roli,et al. An approach to the automatic design of multiple classifier systems , 2001, Pattern Recognit. Lett..

[34] William Nick Street,et al. A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[35] David J. Spiegelhalter,et al. Machine Learning, Neural and Statistical Classification , 2009 .

[36] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[37] Nitesh V. Chawla,et al. Distributed Pasting of Small Votes , 2002, Multiple Classifier Systems.

[38] Neal Leavitt,et al. Data Mining for the Corporate Masses? , 2002, Computer.

[39] Ludmila I. Kuncheva,et al. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[40] Padhraic Smyth,et al. From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[41] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.