An extensive experimental survey of regression methods

Regression is a very relevant problem in machine learning, with many different available approaches. The current work presents a comparison of a large collection composed by 77 popular regression models which belong to 19 families: linear and generalized linear models, generalized additive models, least squares, projection methods, LASSO and ridge regression, Bayesian models, Gaussian processes, quantile regression, nearest neighbors, regression trees and rules, random forests, bagging and boosting, neural networks, deep learning and support vector regression. These methods are evaluated using all the regression datasets of the UCI machine learning repository (83 datasets), with some exceptions due to technical reasons. The experimental work identifies several outstanding regression models: the M5 rule-based model with corrections based on nearest neighbors (cubist), the gradient boosted machine (gbm), the boosting ensemble of regression trees (bstTree) and the M5 regression tree. Cubist achieves the best squared correlation ( R2) in 15.7% of datasets being very near to it, with difference below 0.2 for 89.1% of datasets, and the median of these differences over the dataset collection is very low (0.0192), compared e.g. to the classical linear regression (0.150). However, cubist is slow and fails in several large datasets, while other similar regression models as M5 never fail and its difference to the best R2 is below 0.2 for 92.8% of datasets. Other well-performing regression models are the committee of neural networks (avNNet), extremely randomized regression trees (extraTrees, which achieves the best R2 in 33.7% of datasets), random forest (rf) and ε-support vector regression (svr), but they are slower and fail in several datasets. The fastest regression model is least angle regression lars, which is 70 and 2,115 times faster than M5 and cubist, respectively. The model which requires least memory is non-negative least squares (nnls), about 2 GB, similarly to cubist, while M5 requires about 8 GB. For 97.6% of datasets there is a regression model among the 10 bests which is very near (difference below 0.1) to the best R2, which increases to 100% allowing differences of 0.2. Therefore, provided that our dataset and model collection are representative enough, the main conclusion of this study is that, for a new regression problem, some model in our top-10 should achieve R2 near to the best attainable for that problem.

[1]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[2]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[3]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[4]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[5]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[6]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[7]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[8]  Tong Zhang,et al.  Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations , 2011, IEEE Transactions on Information Theory.

[9]  S. Chiu Method and software for extracting fuzzy classification rules by subtractive clustering , 1996, Proceedings of North American Fuzzy Information Processing.

[10]  S. R. Searle Linear Models , 1971 .

[11]  Annette M. Molinaro,et al.  partDSA: deletion/substitution/addition algorithm for partitioning the covariate space in prediction , 2010, Bioinform..

[12]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[13]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[14]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[15]  Hongming Zhou,et al.  Extreme Learning Machine for Regression and Multiclass Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  S. Wood Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models , 2011 .

[17]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[18]  M. G. Pittau,et al.  A weakly informative default prior distribution for logistic and other regression models , 2008, 0901.4011.

[19]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[20]  Klaus Hechenbichler,et al.  Weighted k-Nearest-Neighbor Techniques and Ordinal Classification , 2004 .

[21]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[22]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[23]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[24]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[25]  Achim Zeileis,et al.  evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R , 2014 .

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[28]  Adam Kapelner,et al.  bartMachine: Machine Learning with Bayesian Additive Regression Trees , 2013, 1312.2171.

[29]  L. Buydens,et al.  Supervised Kohonen networks for classification problems , 2006 .

[30]  Lin Song,et al.  Random generalized linear model: a highly accurate and interpretable ensemble predictor , 2013, BMC Bioinformatics.

[31]  Chad Hazlett,et al.  Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach , 2014, Political Analysis.

[32]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[33]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[34]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[35]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[36]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[37]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  L Kish Statistical medicine. , 1994, Science.

[39]  S. Wold,et al.  A PLS kernel algorithm for data sets with many variables and fewer objects. Part 1: Theory and algorithm , 1994 .

[40]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[41]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[42]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[43]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[44]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[45]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[46]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[47]  C. Braak,et al.  Comments on the PLS kernel algorithm , 1994 .

[48]  T. Pohlert The Pairwise Multiple Comparison of Mean Ranks Package (PMCMR) , 2016 .

[49]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[50]  Kurt Hornik,et al.  The Design and Analysis of Benchmark Experiments , 2005 .

[51]  Bjørn-Helge Mevik,et al.  Mean squared error of prediction (MSEP) estimates for principal component regression (PCR) and partial least squares regression (PLSR) , 2004 .

[52]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[53]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[54]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[55]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[56]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[57]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[58]  George C. Runger,et al.  Gene selection with guided regularized random forest , 2012, Pattern Recognit..

[59]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[60]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[61]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[62]  Max Kuhn,et al.  caret: Classification and Regression Training , 2015 .

[63]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[64]  Christina Gloeckner,et al.  Modern Applied Statistics With S , 2003 .

[65]  C. Dunnett A Multiple Comparison Procedure for Comparing Several Treatments with a Control , 1955 .

[66]  Udaya B. Kogalur,et al.  spikeslab: Prediction and Variable Selection Using Spike and Slab Regression , 2010, R J..

[67]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[68]  N. Meinshausen Node harvest: simple and interpretable regression and classication , 2009, 0910.2145.

[69]  D. Sengupta Linear models , 2003 .

[70]  R. Koenker,et al.  Convex Optimization in R , 2014 .

[71]  Alex J. Cannon Quantile regression neural networks: Implementation in R and application to precipitation downscaling , 2011, Comput. Geosci..