论文信息 - Random forest: A reliable tool for patient response prediction

Random forest: A reliable tool for patient response prediction

The goal of classification is to reliably identify instances that are members of the class of interest. This is especially important for predicting patient response to drugs. However, with high dimensional datasets, classification is both complicated and enhanced by the feature selection process. When designing a classification experiment there are a number of decisions which need to be made in order to maximize performance. These decisions are especially difficult for researchers in fields where data mining is not the focus, such as patient response prediction. It would be easier for such researchers to make these decisions if either their outcomes were chosen or their scope reduced, by using a learner which minimizes the impact of these decisions. We propose that Random Forest, a popular ensemble learner, can serve this role. We performed an experiment involving nineteen different feature selection rankers (eleven of which were proposed and implemented by our research team) to thoroughly test both the Random Forest learner and five other learners. Our research shows that, as long as a large enough number of features are used, the results of using Random Forest are favorable regardless of the choice of feature selection strategy, showing that Random Forest is a suitable choice for patient response prediction researchers who want to do not wish to choose from amongst a myriad of feature selection approaches.

[1] David M. Levine,et al. Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[2] Taghi M. Khoshgoftaar,et al. A Study on the Relationships of Classifier Performance Metrics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[3] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[4] Tian-Yu Liu,et al. EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[5] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[6] Simon Haykin,et al. Neural Networks: A Comprehensive Foundation , 1998 .

[7] S. Cessie,et al. Ridge Estimators in Logistic Regression , 1992 .

[8] Xue-wen Chen,et al. Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9] Giandomenico Spezzano,et al. An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[10] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[11] Paul Horton,et al. Network-based de-noising improves prediction from microarray data , 2006, BMC Bioinformatics.

[12] Taghi M. Khoshgoftaar,et al. Using regression trees to classify fault-prone software modules , 2002, IEEE Trans. Reliab..

[13] Ramón Díaz-Uriarte,et al. Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[14] Lloyd A. Smith,et al. Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper , 1999, FLAIRS.

[15] Taghi M. Khoshgoftaar,et al. Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[16] Taghi M. Khoshgoftaar,et al. An Empirical Study of Software Metrics Selection Using Support Vector Machine , 2011, SEKE.