A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

BackgroundRegularized regression methods such as principal component or partial least squares regression perform well in learning tasks on high dimensional spectral data, but cannot explicitly eliminate irrelevant features. The random forest classifier with its associated Gini feature importance, on the other hand, allows for an explicit feature elimination, but may not be optimally adapted to spectral data due to the topology of its constituent classification trees which are based on orthogonal splits in feature space.ResultsWe propose to combine the best of both approaches, and evaluated the joint use of a feature selection based on a recursive feature elimination using the Gini importance of random forests' together with regularized classification methods on spectral data sets from medical diagnostics, chemotaxonomy, biomedical analytics, food science, and synthetically modified spectral data. Here, a feature selection using the Gini feature importance with a regularized classification by discriminant partial least squares regression performed as well as or better than a filtering according to different univariate statistical tests, or using regression coefficients in a backward feature elimination. It outperformed the direct application of the random forest classifier, or the direct application of the regularized classifiers on the full set of features.ConclusionThe Gini importance of the random forest provided superior means for measuring feature relevance on spectral data, but – on an optimal subset of features – the regularized classifiers might be preferable over the random forest classifier, in spite of their limitation to model linear dependencies only. A feature selection based on Gini importance, however, may precede a regularized linear classification to identify this optimal subset of features, and to earn a double benefit of both dimensionality reduction and the elimination of noise from the classification task.

[1]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[2]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[3]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[4]  M. Forina,et al.  Iterative predictor weighting (IPW) PLS: a technique for the elimination of useless predictors in regression problems , 1999 .

[5]  P. J. Brown,et al.  Calibration with Many Variables , 1993 .

[6]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[7]  K. Baumann,et al.  A systematic evaluation of the benefits and hazards of variable selection in latent variable regression. Part I. Search algorithm, theory and simulations , 2002 .

[8]  Jean-Pierre Gauchi,et al.  Comparison of selection methods of explanatory variables in PLS regression with application to manufacturing process data , 2001 .

[9]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[10]  Gengfeng Wu,et al.  Dimension reduction with redundant gene elimination for tumor classification , 2008, BMC Bioinformatics.

[11]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[12]  Bjoern H Menze,et al.  Optimal classification of long echo time in vivo magnetic resonance spectra in the detection of recurrent brain tumors , 2006, NMR in biomedicine.

[13]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[14]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[15]  Kurt Hornik,et al.  The Design and Analysis of Benchmark Experiments , 2005 .

[16]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[17]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[20]  S. Verzakov,et al.  Classification of signatures of Bovine Spongiform Encephalopathy in serum using infrared spectroscopy. , 2004, The Analyst.

[21]  Bjoern H Menze,et al.  Multivariate feature selection and hierarchical classification for infrared spectroscopy: serum-based detection of bovine spongiform encephalopathy , 2007, Analytical and bioanalytical chemistry.

[22]  Marco Sandri,et al.  A Bias Correction Algorithm for the Gini Variable Importance Measure in Classification Trees , 2008 .

[23]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[24]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[25]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[26]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[27]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[28]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[29]  Anders Björkström,et al.  A Generalized View on Continuum Regression , 1999 .

[30]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[31]  Chong Jin Ong,et al.  A Feature Selection Method for Multilevel Mental Fatigue EEG Classification , 2007, IEEE Transactions on Biomedical Engineering.

[32]  H. Martens,et al.  Variable Selection in near Infrared Spectroscopy Based on Significance Testing in Partial Least Squares Regression , 2000 .

[33]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[34]  B. Nadler,et al.  The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration , 2005 .

[35]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[36]  Johan Trygg,et al.  K-OPLS package: Kernel-based orthogonal projections to latent structures for prediction and interpretation in feature space , 2008, BMC Bioinformatics.

[37]  R. Somorjai,et al.  Rapid Identification of Candida Species by Using Nuclear Magnetic Resonance Spectroscopy and a Statistical Classification Strategy , 2003, Applied and Environmental Microbiology.

[38]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[39]  Harshinder Singh,et al.  Application of the Random Forest Method in Studies of Local Lymph Node Assay Based Skin Sensitization Data , 2005, J. Chem. Inf. Model..

[40]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[41]  Bjoern H. Menze,et al.  Machine-Based Rejection of Low-Quality Spectra and Estimation of Brain Tumor Probabilities from Magnetic Resonance Spectroscopic Images , 2006, Bildverarbeitung für die Medizin.

[42]  R. Leardi Genetic algorithms in chemometrics and chemistry: a review , 2001 .

[43]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .