Do we need hundreds of classifiers to solve real world classification problems?

We evaluate 179 classifiers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classifiers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearest-neighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classifiers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve significant conclusions about the classifier behavior, not dependent on the data set collection. The classifiers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the difference is not statistically significant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).

[1]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[2]  Jadzia Cendrowska,et al.  PRISM: An Algorithm for Inducing Modular Rules , 1987, Int. J. Man Mach. Stud..

[3]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[4]  J. Friedman Regularized Discriminant Analysis , 1989 .

[5]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[6]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[7]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[10]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[11]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[12]  R. Tibshirani,et al.  Flexible Discriminant Analysis by Optimal Scoring , 1994 .

[13]  Michael R. Berthold,et al.  Boosting the Performance of RBF Networks with Dynamic Decay Adjustment , 1994, NIPS.

[14]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[15]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[16]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[17]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[18]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[19]  S. Salzberg,et al.  INSTANCE-BASED LEARNING : Nearest Neighbour with Generalisation , 1995 .

[20]  Ron Kohavi,et al.  The Power of Decision Tables , 1995, ECML.

[21]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[22]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[23]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[24]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[25]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[26]  Bhupinder S. Dayal,et al.  Improved PLS algorithms , 1997 .

[27]  Ian H. Witten,et al.  Stacking Bagged and Dagged Models , 1997, ICML.

[28]  H. Altay Güvenir,et al.  Classification by Voting Feature Intervals , 1997, ECML.

[29]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[30]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[31]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  S. Kropf,et al.  Multivariate tests based on left-spherically distributed linear scores , 1998 .

[33]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[34]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[36]  Geoff Holmes,et al.  Generating Rule Sets from Model Trees , 1999, Australian Joint Conference on Artificial Intelligence.

[37]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[38]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[39]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[40]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[41]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[42]  Johannes Fürnkranz,et al.  An Evaluation of Grading Classifiers , 2001, IDA.

[43]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.

[44]  Geoff Holmes,et al.  Racing Committees for Large Datasets , 2002, Discovery Science.

[45]  Alexander K. Seewald,et al.  How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness , 2002, International Conference on Machine Learning.

[46]  Eric R. Ziegel,et al.  An Introduction to Generalized Linear Models , 2002, Technometrics.

[47]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[50]  Bernhard Pfahringer,et al.  Locally Weighted Naive Bayes , 2002, UAI.

[51]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[52]  Yong Wang,et al.  Using Model Trees for Classification , 1998, Machine Learning.

[53]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[54]  Brian R. Gaines,et al.  Induction of ripple-down rules applied to modeling large databases , 1995, Journal of Intelligent Information Systems.

[55]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[56]  Stefan Kramer,et al.  Ensembles of nested dichotomies for multi-class problems , 2004, ICML.

[57]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[58]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[59]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[60]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[61]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[62]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[63]  Geoffrey I. Webb,et al.  Lazy Learning of Bayesian Rules , 2000, Machine Learning.

[64]  R. Reulke,et al.  Remote Sensing and Spatial Information Sciences , 2005 .

[65]  R. Gentleman,et al.  Classification Using Generalized Partial Least Squares , 2005 .

[66]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[67]  Raymond J. Mooney,et al.  Creating diversity in ensembles using artificial data , 2005, Inf. Fusion.

[68]  Kurt Hornik,et al.  The Design and Analysis of Benchmark Experiments , 2005 .

[69]  Geoffrey I. Webb,et al.  Not So Naive Bayes: Aggregating One-Dependence Estimators , 2005, Machine Learning.

[70]  N. Pfeifer,et al.  Neighborhood systems for airborne laser data , 2005 .

[71]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  L. Buydens,et al.  Supervised Kohonen networks for classification problems , 2006 .

[73]  George Vosselman,et al.  Segmentation of point clouds using smoothness constraints , 2006 .

[74]  Mark Girolami,et al.  Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors , 2006, Neural Computation.

[75]  M. Forina,et al.  Multivariate calibration. , 2007, Journal of chromatography. A.

[76]  Esteban Alfaro Cortés,et al.  Multiclass Corporate Failure Prediction by Adaboost.M1 , 2007 .

[77]  Reinhard Klein,et al.  Efficient RANSAC for Point‐Cloud Shape Detection , 2007, Comput. Graph. Forum.

[78]  F. Tarsha-Kurdi,et al.  Hough-Transform and Extended RANSAC Algorithms for Automatic Detection of 3D Building Roof Planes from Lidar Data , 2007 .

[79]  M. G. Pittau,et al.  A weakly informative default prior distribution for logistic and other regression models , 2008, 0901.4011.

[80]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[81]  Eibe Frank,et al.  Combining Naive Bayes and Decision Tables , 2008, FLAIRS.

[82]  Peter Auer,et al.  A learning rule for very simple universal approximators consisting of a single layer of perceptrons , 2008, Neural Networks.

[83]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[84]  Thomas A. Funkhouser,et al.  Min-cut based segmentation of point clouds , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[85]  Peter Filzmoser,et al.  An Object-Oriented Framework for Robust Multivariate Analysis , 2009 .

[86]  Alfred Kar Yin Truong Fast growing and interpretable oblique trees via logistic regression models , 2009 .

[87]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[88]  Jiang Wanshou Contour Clustering Analysis for Building Reconstruction from LIDAR Data , 2010 .

[89]  Jie Shan,et al.  Segmentation and Reconstruction of Polyhedral Building Roofs From Aerial Lidar Point Clouds , 2010, IEEE Transactions on Geoscience and Remote Sensing.

[90]  Michael Cramer,et al.  The DGPF-Test on Digital Airborne Camera Evaluation - Over- view and Test Design , 2010 .

[91]  K. Strimmer,et al.  Feature selection in omics prediction problems using cat scores and false nondiscovery rate control , 2009, 0903.2003.

[92]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[93]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[94]  Geoff Holmes,et al.  Experiment databases , 2012, Machine Learning.

[95]  Trevor J. Hastie,et al.  Sparse Discriminant Analysis , 2011, Technometrics.

[96]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[97]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[98]  R. Tibshirani,et al.  Penalized classification using Fisher's linear discriminant , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[99]  Senén Barro,et al.  Direct Parallel Perceptrons (DPPs): Fast Analytical Calculation of the Parallel Perceptrons Weights With Margin Control for Classification Tasks , 2011, IEEE Transactions on Neural Networks.

[100]  Hendrik Blockeel,et al.  A new way to share, organize and learn from experiments , 2012 .

[101]  George C. Runger,et al.  Feature selection via regularized trees , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[102]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[103]  C. Bouveyron,et al.  HDclassif: an R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data , 2012 .

[104]  Tin Kam Ho,et al.  Learner excellence biased by data set selection: A case for data characterisation and artificial data sets , 2013, Pattern Recognit..

[105]  Manuel Fernández Delgado,et al.  Exhaustive comparison of colour texture features and classification methods to discriminate cells categories in histological images of fish ovary , 2013, Pattern Recognit..

[106]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[107]  Núria Macià,et al.  Towards UCI+: A mindful repository design , 2014, Inf. Sci..

[108]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[109]  José Neves,et al.  Direct Kernel Perceptron (DKP): Ultra-fast kernel ELM-based classification with non-iterative closed-form weight calculation , 2014, Neural Networks.

[110]  T. Hothorn,et al.  Domain-Based Benchmark Experiments: Exploratory and Inferential Analysis , 2016 .