Random forest: A reliable tool for patient response prediction

The goal of classification is to reliably identify instances that are members of the class of interest. This is especially important for predicting patient response to drugs. However, with high dimensional datasets, classification is both complicated and enhanced by the feature selection process. When designing a classification experiment there are a number of decisions which need to be made in order to maximize performance. These decisions are especially difficult for researchers in fields where data mining is not the focus, such as patient response prediction. It would be easier for such researchers to make these decisions if either their outcomes were chosen or their scope reduced, by using a learner which minimizes the impact of these decisions. We propose that Random Forest, a popular ensemble learner, can serve this role. We performed an experiment involving nineteen different feature selection rankers (eleven of which were proposed and implemented by our research team) to thoroughly test both the Random Forest learner and five other learners. Our research shows that, as long as a large enough number of features are used, the results of using Random Forest are favorable regardless of the choice of feature selection strategy, showing that Random Forest is a suitable choice for patient response prediction researchers who want to do not wish to choose from amongst a myriad of feature selection approaches.

[1]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[2]  Taghi M. Khoshgoftaar,et al.  A Study on the Relationships of Classifier Performance Metrics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Tian-Yu Liu,et al.  EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[5]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[6]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[7]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[8]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[10]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[11]  Paul Horton,et al.  Network-based de-noising improves prediction from microarray data , 2006, BMC Bioinformatics.

[12]  Taghi M. Khoshgoftaar,et al.  Using regression trees to classify fault-prone software modules , 2002, IEEE Trans. Reliab..

[13]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[14]  Lloyd A. Smith,et al.  Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[15]  Taghi M. Khoshgoftaar,et al.  Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[16]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Software Metrics Selection Using Support Vector Machine , 2011, SEKE.

[17]  Anthony Boral,et al.  Gene expression profiling and correlation with outcome in clinical trials of the proteasome inhibitor bortezomib. , 2006, Blood.

[18]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[19]  Stan Matwin,et al.  STochFS: A Framework for Combining Feature Selection Outcomes Through a Stochastic Process , 2005, PKDD.

[20]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[21]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[22]  M. Silvapulle,et al.  Ridge estimation in logistic regression , 1988 .

[23]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[24]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.