The parameter sensitivity of random forests

BackgroundThe Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here.ResultsWe examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters.ConclusionsParameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[3]  Xuefeng Bruce Ling,et al.  Multiclass cancer classification and biomarker discovery using GA-based algorithms , 2005, Bioinform..

[4]  Philippe Salembier,et al.  NetBenchmark: a bioconductor package for reproducible benchmarks of gene regulatory network inference , 2015, BMC Bioinformatics.

[5]  Jeng-Shyang Pan,et al.  Kernel Learning Algorithms for Face Recognition , 2013 .

[6]  Shyam Visweswaran,et al.  Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data , 2015, BMC Bioinformatics.

[7]  L. Lin,et al.  A concordance correlation coefficient to evaluate reproducibility. , 1989, Biometrics.

[8]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[9]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[10]  L. Breiman OUT-OF-BAG ESTIMATION , 1996 .

[11]  Antonio Criminisi,et al.  Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning , 2012, Found. Trends Comput. Graph. Vis..

[12]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[13]  Hong Liu,et al.  A systematic evaluation of high-dimensional, ensemble-based regression for exploring large model spaces in microbiome analyses , 2015, BMC Bioinformatics.

[14]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[15]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[16]  Hongyu Zhao,et al.  The application of sparse estimation of covariance matrix to quadratic discriminant analysis , 2015, BMC Bioinformatics.

[17]  Juilee Thakar,et al.  GESPA: classifying nsSNPs to predict disease association , 2015, BMC Bioinformatics.

[18]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[19]  Yue Zhang,et al.  Optimal combination of feature selection and classification via local hyperplane based learning strategy , 2015, BMC Bioinformatics.

[20]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[21]  Cathal Seoighe,et al.  Seq-ing improved gene expression estimates from microarrays using machine learning , 2015, BMC Bioinformatics.

[22]  Gabor Grothendieck,et al.  Lattice: Multivariate Data Visualization with R , 2008 .

[23]  Robert E Denroche,et al.  SeqControl: process control for DNA sequencing , 2014, Nature Methods.

[24]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[25]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[26]  Allison Chia-Yi Wu,et al.  Aro: a machine learning approach to identifying single molecules and estimating classification error in fluorescence microscopy images , 2015, BMC Bioinformatics.

[27]  Gregory Ditzler,et al.  Fizzy: feature subset selection for metagenomics , 2015, BMC Bioinformatics.

[28]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[29]  Hugh Ellis,et al.  A methodology for exploring biomarker – phenotype associations: application to flow cytometry data and systemic sclerosis clinical manifestations , 2015, BMC Bioinformatics.

[30]  Ying Shen,et al.  RNA-binding residues prediction using structural features , 2015, BMC Bioinformatics.

[31]  Dorit Merhof,et al.  Discrimination of cell cycle phases in PCNA-immunolabeled cells , 2015, BMC Bioinformatics.

[32]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[33]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[34]  Piotr Wojtek Dabrowski,et al.  SuRankCo: supervised ranking of contigs in de novo assemblies , 2015, BMC Bioinformatics.

[35]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[36]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[37]  Víctor Urrea,et al.  Letter to the Editor: Stability of Random Forest importance measures , 2011, Briefings Bioinform..

[38]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[39]  Igor Jurisica,et al.  Gene expression–based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study , 2008, Nature Medicine.

[40]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[41]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[42]  Keehyoung Joo,et al.  Sigma-RF: prediction of the variability of spatial restraints in template-based modeling by random forest , 2015, BMC Bioinformatics.

[43]  Eric Boerwinkle,et al.  Application of machine learning algorithms to predict coronary artery calcification with a sibship‐based design , 2008, Genetic epidemiology.

[44]  Luigi Mariani,et al.  Proposal of supervised data analysis strategy of plasma miRNAs from hybridisation array data with an application to assess hemolysis-related deregulation , 2015, BMC Bioinformatics.

[45]  Thomas Lengauer,et al.  Learning from Past Treatments and Their Outcome Improves Prediction of In Vivo Response to Anti-HIV Therapy , 2011, Statistical applications in genetics and molecular biology.

[46]  Rok Blagus,et al.  Boosting for high-dimensional two-class prediction , 2015, BMC Bioinformatics.

[47]  Anil Singh,et al.  Learning-guided automatic three dimensional synapse quantification for drosophila neurons , 2015, BMC Bioinformatics.

[48]  Marzia A. Cremona,et al.  Peak shape clustering reveals biological insights , 2015, BMC Bioinformatics.

[49]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[50]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[51]  Benjamin Hofner,et al.  Controlling false discoveries in high-dimensional situations: boosting with stability selection , 2014, BMC Bioinformatics.

[52]  Aik Choon Tan,et al.  Ensemble machine learning on gene expression data for cancer classification. , 2003, Applied bioinformatics.

[53]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[54]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[55]  J. Potter,et al.  A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. , 2003, Biostatistics.

[56]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[57]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[58]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[59]  Giancarlo Raiconi,et al.  A multi-view genomic data simulator , 2015, BMC Bioinformatics.

[60]  Oleg Okun,et al.  Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues , 2007, IbPRIA.

[61]  Putri W. Novianti,et al.  Factors affecting the accuracy of a class prediction model in gene expression data , 2015, BMC Bioinformatics.

[62]  David S. Wishart,et al.  Applications of Machine Learning in Cancer Prediction and Prognosis , 2006, Cancer informatics.

[63]  Robert F Murphy,et al.  An active role for machine learning in drug development. , 2011, Nature chemical biology.

[64]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[65]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[66]  Giancarlo Raiconi,et al.  MVDA: a multi-view genomic data integration methodology , 2015, BMC Bioinformatics.

[67]  Riccardo Bellazzi,et al.  PaPI: pseudo amino acid composition to score human protein-coding variants , 2015, BMC Bioinformatics.

[68]  Mark R. Segal,et al.  Machine Learning Benchmarks and Random Forest Regression , 2004 .

[69]  Yan V Sun,et al.  Multigenic modeling of complex disease by random forests. , 2010, Advances in genetics.

[70]  Alfredo Vellido,et al.  Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors , 2015, BMC Bioinformatics.

[71]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[72]  James Green,et al.  ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins , 2015, BMC Bioinformatics.

[73]  Sandrine Dudoit,et al.  Classification in microarray experiments , 2003 .

[74]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[75]  Melanie Hilario,et al.  Machine learning approaches to lung cancer prediction from mass spectra , 2003, Proteomics.

[76]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[77]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[78]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[79]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[80]  E. Polley,et al.  Statistical Applications in Genetics and Molecular Biology Random Forests for Genetic Association Studies , 2011 .

[81]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[82]  J. Stec,et al.  Gene expression profiles predict complete pathologic response to neoadjuvant paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide chemotherapy in breast cancer. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[83]  George M. Spyrou,et al.  mAPKL: R/ Bioconductor package for detecting gene exemplars and revealing their characteristics , 2015, BMC Bioinformatics.

[84]  Dhruba Kumar Bhattacharyya,et al.  Classification of microarray cancer data using ensemble approach , 2013, Network Modeling Analysis in Health Informatics and Bioinformatics.