Random forest versus logistic regression: a large-scale benchmark experiment

Background and goalThe Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields.ResultsIn this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases.ConclusionRF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.

[1]  Carolin Strobl,et al.  Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations , 2012, Briefings Bioinform..

[2]  Anne-Laure Boulesteix,et al.  A Plea for Neutral Comparison Studies in Computational Sciences , 2012, PloS one.

[3]  Bernd Bischl,et al.  mlr: Machine Learning in R , 2016, J. Mach. Learn. Res..

[4]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[5]  Putri W. Novianti,et al.  Selecting a classification function for class prediction with gene expression data , 2016, Bioinform..

[6]  Bernd Bischl,et al.  batchtools: Tools for R to work on batch systems , 2017, J. Open Source Softw..

[7]  Xiaoyu Jiang,et al.  IPF-LASSO: Integrative L 1-Penalized Regression with Penalty Factors for Prediction Based on Multi-Omics Data , 2017, Comput. Math. Methods Medicine.

[8]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[9]  Edward R. Dougherty,et al.  Reporting bias when using real data sets to analyze classification performance , 2010, Bioinform..

[10]  A. Boulesteix,et al.  A Statistical Framework for Hypothesis Testing in Real Data Comparison Studies , 2015 .

[11]  Carl Boettiger,et al.  An introduction to Docker for reproducible research , 2014, OPSR.

[12]  Galit Shmueli,et al.  To Explain or To Predict? , 2010 .

[13]  Daniel S. Myers,et al.  Simple statistical models predict C-to-U edited sites in plant mitochondrial RNA , 2004, BMC Bioinformatics.

[14]  Bernd Bischl,et al.  Resampling Methods for Meta-Model Validation with Recommendations for Evolutionary Computation , 2012, Evolutionary Computation.

[15]  Bernd Bischl,et al.  Tunability: Importance of Hyperparameters of Machine Learning Algorithms , 2018, J. Mach. Learn. Res..

[16]  Kaspar Rufibach,et al.  Use of Brier score to assess binary predictions. , 2010, Journal of clinical epidemiology.

[17]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[18]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[19]  Philipp Probst,et al.  Making complex prediction rules applicable for readers: Current practice in random forest literature and recommendations , 2018, Biometrical journal. Biometrische Zeitschrift.

[20]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[21]  Philipp Probst,et al.  To tune or not to tune the number of trees in random forest? , 2017, J. Mach. Learn. Res..

[22]  Bernd Bischl,et al.  Benchmarking local classification methods , 2013, Computational Statistics.

[23]  Anne-Laure Boulesteix,et al.  Machine learning versus statistical modeling , 2014, Biometrical journal. Biometrische Zeitschrift.

[24]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[25]  Anne-Laure Boulesteix,et al.  Subsampling versus bootstrapping in resampling‐based model selection for multivariable regression , 2016, Biometrics.

[26]  Philipp Probst,et al.  Hyperparameters and tuning strategies for random forest , 2018, WIREs Data Mining Knowl. Discov..

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[28]  Anne-Laure Boulesteix,et al.  Ten Simple Rules for Reducing Overoptimistic Reporting in Methodological Computational Research , 2015, PLoS Comput. Biol..

[29]  Ricardo Vilalta,et al.  Introduction to the Special Issue on Meta-Learning , 2004, Machine Learning.

[30]  Sergio Contrino,et al.  ArrayExpress—a public repository for microarray gene expression data at the EBI , 2004, Nucleic Acids Res..

[31]  Rory Wilson,et al.  Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies , 2017, BMC Medical Research Methodology.

[32]  Galit Shmueli,et al.  To Explain or To Predict? , 2010, 1101.0891.

[33]  Paul C. Boutros,et al.  The parameter sensitivity of random forests , 2016, BMC Bioinformatics.

[34]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[35]  E. Polley,et al.  Statistical Applications in Genetics and Molecular Biology Random Forests for Genetic Association Studies , 2011 .

[36]  Jingrui He,et al.  Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data , 2016, Political Analysis.

[37]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[38]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[39]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .