Understanding Random Forests: From Theory to Practice

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyse and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances [...].

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[3]  Pierre Geurts,et al.  Bias vs Variance Decomposition for Regression and Classification , 2005, Data Mining and Knowledge Discovery Handbook.

[4]  Udaya B. Kogalur,et al.  Consistency of Random Survival Forests. , 2008, Statistics & probability letters.

[5]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[6]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[7]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[8]  F. Tegenfeldt,et al.  TMVA - Toolkit for multivariate data analysis , 2012 .

[9]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[10]  Thomas G. Dietterich,et al.  Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms , 2008 .

[11]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[12]  Jinyang Li,et al.  Learning Random Forests on the GPU , 2013 .

[13]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[14]  Pierre Geurts,et al.  Contributions to decision tree induction: bias/variance tradeoff and time series classification , 2002 .

[15]  Jos Boekhorst,et al.  Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? , 2012, Briefings Bioinform..

[16]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[17]  David H. Wolpert,et al.  An Efficient Method To Estimate Bagging's Generalization Error , 1999, Machine Learning.

[18]  Lawrence Mitchell,et al.  A parallel random forest classifier for R , 2011, ECMLS '11.

[19]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[20]  Derek C. Rose,et al.  Deep Machine Learning - A New Frontier in Artificial Intelligence Research [Research Frontier] , 2010, IEEE Computational Intelligence Magazine.

[21]  Nitesh V. Chawla,et al.  Learning Ensembles from Bites: A Scalable and Accurate Approach , 2004, J. Mach. Learn. Res..

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  Gilles Louppe,et al.  Ensembles on Random Patches , 2012, ECML/PKDD.

[24]  Yi Yu,et al.  Performance of random forest when SNPs are in linkage disequilibrium , 2009, BMC Bioinformatics.

[25]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[26]  John Mingers,et al.  An Empirical Comparison of Pruning Methods for Decision Tree Induction , 1989, Machine Learning.

[27]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[28]  Leo Breiman,et al.  Pasting Small Votes for Classification in Large Databases and On-Line , 1999, Machine Learning.

[29]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[31]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[32]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[33]  M. Stone Cross-validation:a review 2 , 1978 .

[34]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[35]  John A. Sonquist,et al.  Multivariate model building;: The validation of a search strategy , 1970 .

[36]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[37]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[38]  A. McGovern,et al.  Solar Energy Prediction: An International Contest to Initiate Interdisciplinary Research on Compelling Meteorological Problems , 2015 .

[39]  Yvan Saeys,et al.  Statistical interpretation of machine learning-based feature importance scores for biomarker discovery , 2012, Bioinform..

[40]  Zhi-Hua Zhou,et al.  Ensemble Methods: Foundations and Algorithms , 2012 .

[41]  James N. Morgan,et al.  Searching for structure;: An approach to analysis of substantial bodies of micro-data and documentation for a computer program , 1973 .

[42]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[43]  M. Shaw,et al.  Induction of fuzzy decision trees , 1995 .

[44]  Martin Mozina,et al.  Orange: data mining toolbox in python , 2013, J. Mach. Learn. Res..

[45]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[46]  Jean-Michel Poggi,et al.  Variable selection using random forests , 2010, Pattern Recognit. Lett..

[47]  Vipin Kumar,et al.  Parallel formulations of decision-tree classification algorithms , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[48]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[49]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[50]  Misha Denil,et al.  Narrowing the Gap: Random Forests In Theory and In Practice , 2013, ICML.

[51]  Gary R. Bradski,et al.  Learning OpenCV - computer vision with the OpenCV library: software that sees , 2008 .

[52]  Vincent Botta,et al.  A walk into random forests: adaptation and application to Genome-Wide Association Studies , 2013 .

[53]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[54]  George C. Runger,et al.  Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination , 2009, J. Mach. Learn. Res..

[55]  Chris Carter,et al.  Multiple decision trees , 2013, UAI.

[56]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[57]  H. Ishwaran Variable importance in binary regression trees and forests , 2007, 0711.2434.

[58]  Michael T. Goodrich,et al.  Algorithm Design: Foundations, Analysis, and Internet Examples , 2001 .

[59]  Mandy Eberhart,et al.  Decision Forests For Computer Vision And Medical Image Analysis , 2016 .

[60]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[61]  Tamara G. Kolda,et al.  COMET: A Recipe for Learning and Using Large Ensembles on Massive Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[62]  Juan José Rodríguez Diez,et al.  An Experimental Study on Rotation Forest Ensembles , 2007, MCS.

[63]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[64]  Alexander R. Pico,et al.  Pathway Analysis of Single-Nucleotide Polymorphisms Potentially Associated with Glioblastoma Multiforme Susceptibility Using Random Forests , 2008, Cancer Epidemiology Biomarkers & Prevention.

[65]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[66]  Isabelle Guyon,et al.  A Scaling Law for the Validation-Set Training-Set Size Ratio , 1997 .

[67]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[68]  Zaher Dawy,et al.  An approximation to the distribution of finite sample size mutual information estimates , 2005, IEEE International Conference on Communications, 2005. ICC 2005. 2005.

[69]  Yali Amit,et al.  Joint Induction of Shape Features and Tree Classifiers , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[70]  Gilles Louppe,et al.  Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies , 2014, PloS one.

[71]  Raphaël Marée,et al.  Extremely Randomized Trees and Random Subwindows for Image Classification, Annotation, and Retrieval , 2013 .

[72]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[73]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[74]  Gilles Louppe,et al.  Independent consultant , 2013 .

[75]  Pierre Geurts,et al.  Supervised learning with decision tree-based methods in computational and systems biology. , 2009, Molecular bioSystems.

[76]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[77]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[78]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[79]  Md. Monirul Islam,et al.  A review on automatic image annotation techniques , 2012, Pattern Recognit..

[80]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[81]  Gilles Louppe,et al.  Learning to rank with extremely randomized trees , 2010, Yahoo! Learning to Rank Challenge.

[82]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[83]  Ryszard S. Romaniuk,et al.  Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC , 2012 .

[84]  J. Adamo Fuzzy decision trees , 1980 .

[85]  Godfried T. Toussaint,et al.  Bibliography on estimation of misclassification , 1974, IEEE Trans. Inf. Theory.

[86]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[87]  Alok Choudhary,et al.  Interactive presentation: An FPGA implementation of decision tree classification , 2007 .

[88]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[89]  Gilles Louppe,et al.  A hybrid human-computer approach for large-scale image-based measurements using web services and machine learning , 2014, 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI).

[90]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[91]  J. Saxe,et al.  A general method for solving divide-and-conquer recurrences , 1980, SIGA.

[92]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[93]  P. Geurts,et al.  Inferring Regulatory Networks from Expression Data Using Tree-Based Methods , 2010, PloS one.

[94]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[95]  Pedro M. Domingos A Unified Bias-Variance Decomposition for Zero-One and Squared Loss , 2000, AAAI/IAAI.

[96]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[97]  Louis Wehenkel,et al.  Automatic Learning Techniques in Power Systems , 1997 .

[98]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[99]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[100]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[101]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[102]  João Gama,et al.  Functional Trees , 2001, Machine Learning.

[103]  Adele Cutler,et al.  PERT – Perfect Random Tree Ensembles , 2001 .

[104]  George M. Siouris,et al.  Applied Optimal Control: Optimization, Estimation, and Control , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[105]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[106]  Masahiro Miyakawa Criteria for Selecting a Variable in the Construction of Efficient Decision Trees , 1989, IEEE Trans. Computers.

[107]  Jerome H. Friedman,et al.  A Recursive Partitioning Decision Rule for Nonparametric Classification , 1977, IEEE Transactions on Computers.

[108]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[109]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[110]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[111]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[112]  Saso Dzeroski,et al.  Combining Bagging and Random Subspaces to Create Better Ensembles , 2007, IDA.

[113]  Louis Wehenkel,et al.  A complete fuzzy decision tree technique , 2003, Fuzzy Sets Syst..

[114]  Simon Kasif,et al.  Induction of Oblique Decision Trees , 1993, IJCAI.

[115]  Guohua Zhao A New Perspective on Classification , 2000 .

[116]  Philippe Flajolet,et al.  An introduction to the analysis of algorithms , 1995 .

[117]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[118]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[119]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[120]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[121]  L. Wehenkel On uncertainty measures used for decision tree induction , 1996 .

[122]  Gilles Louppe,et al.  A zealous parallel gradient descent algorithm , 2010 .

[123]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[124]  J. R. Quinlan Discovering rules by induction from large collections of examples Intro-ductory readings in expert s , 1979 .

[125]  Richard Kufrin,et al.  Decision trees on parallel processors , 1997, Parallel Processing for Artificial Intelligence 3.

[126]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[127]  Misha Denil,et al.  Consistency of Online Random Forests , 2013, ICML.

[128]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[129]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[130]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[131]  J. Morgan,et al.  Thaid a Sequential Analysis Program for the Analysis of Nominal Scale Dependent Variables , 1973 .

[132]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[133]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[134]  Carolin Strobl,et al.  Unbiased split selection for classification trees based on the Gini Index , 2007, Comput. Stat. Data Anal..

[135]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[136]  Donald E. Knuth Two notes on notation , 1992 .

[137]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[138]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[139]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[140]  Jean-Philippe Vert,et al.  Consistency of Random Forests , 2014, 1405.2881.

[141]  Jon Louis Bentley,et al.  Engineering a sort function , 1993, Softw. Pract. Exp..

[142]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[143]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[144]  John Mingers,et al.  An empirical comparison of selection measures for decision-tree induction , 2004, Machine Learning.

[145]  R. C. Messenger,et al.  A Modal Search Technique for Predictive Nominal Scale Multivariate Analysis , 1972 .

[146]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[147]  Steven L. Salzberg,et al.  On growing better decision trees from data , 1996 .

[148]  Hongyu Zhao,et al.  Pathway analysis using random forests classification and regression , 2006, Bioinform..

[149]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[150]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[151]  J. Friedman A tree-structured approach to nonparametric multiple regression , 1979 .

[152]  Gilles Louppe,et al.  Simple Connectome Inference from Partial Correlation Statistics in Calcium Imaging , 2014, Neural Connectomics.

[153]  Pierre Geurts,et al.  Investigation and Reduction of Discretization Variance in Decision Tree Induction , 2000, ECML.

[154]  Gareth James,et al.  Variance and Bias for General Loss Functions , 2003, Machine Learning.

[155]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[156]  Saso Dzeroski,et al.  Ensembles of Multi-Objective Decision Trees , 2007, ECML.

[157]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[158]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[159]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[160]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[161]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[162]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[163]  David R. Musser,et al.  Introspective Sorting and Selection Algorithms , 1997, Softw. Pract. Exp..

[164]  Pierre Geurts,et al.  Kernelizing the output of tree-based methods , 2006, ICML '06.

[165]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[166]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[167]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[168]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[169]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[170]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[171]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[172]  Robert Tibshirani,et al.  Bias, Variance and Prediction Error for Classification Rules , 1996 .

[173]  Eugene Tuv,et al.  Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.