Enhanced Cancer Recognition System Based on Random Forests Feature Elimination Algorithm

Accurate classifiers are vital to design precise computer aided diagnosis (CADx) systems. Classification performances of machine learning algorithms are sensitive to the characteristics of data. In this aspect, determining the relevant and discriminative features is a key step to improve performance of CADx. There are various feature extraction methods in the literature. However, there is no universal variable selection algorithm that performs well in every data analysis scheme. Random Forests (RF), an ensemble of trees, is used in classification studies successfully. The success of RF algorithm makes it eligible to be used as kernel of a wrapper feature subset evaluator. We used best first search RF wrapper algorithm to select optimal features of four medical datasets: colon cancer, leukemia cancer, breast cancer and lung cancer. We compared accuracies of 15 widely used classifiers trained with all features versus to extracted features of each dataset. The experimental results demonstrated the efficiency of proposed feature extraction strategy with the increase in most of the classification accuracies of the algorithms.

[1]  Wilfried N. Gansterer,et al.  On the Relationship Between Feature Selection and Classification Accuracy , 2008, FSDM.

[2]  Xueqin Hu,et al.  Application of improved random forest variables importance measure to traditional Chinese chronic gastritis diagnosis , 2008, 2008 IEEE International Symposium on IT in Medicine and Education.

[3]  James D Katz,et al.  Random Forests Classification Analysis for the Assessment of Diagnostic Skill , 2010, American journal of medical quality : the official journal of the American College of Medical Quality.

[4]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[5]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[6]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[7]  Jing-Yu Yang,et al.  Optimal discriminant plane for a small number of samples and design method of classifier on the plane , 1991, Pattern Recognit..

[8]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Yanchun Zhang,et al.  AdaBoost algorithm with random forests for predicting breast cancer survivability , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[10]  Zhi-Hua Zhou,et al.  Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[11]  Shiliang Sun,et al.  An experimental evaluation of ensemble methods for EEG signal classification , 2007, Pattern Recognit. Lett..

[12]  Fan Yang,et al.  Using random forest for reliable classification and cost-sensitive learning for medical diagnosis , 2009, BMC Bioinformatics.

[13]  Madhubalan Viswanathan,et al.  Measurement error and research design , 2005 .

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Jonathan Cheung-Wai Chan,et al.  Evaluation of random forest and adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery , 2008 .

[16]  F. Segovia,et al.  Computer aided diagnosis system for the Alzheimer's disease based on partial least squares and random forest SPECT image classification , 2010, Neuroscience Letters.

[17]  Aleix M. Martínez,et al.  Where are linear feature extraction methods applicable? , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Akin Ozçift,et al.  Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis. , 2011, Computers in biology and medicine.

[19]  Chul-Woo Kim,et al.  Feature Elimination Approach Based on Random Forest for Cancer Diagnosis , 2006, MICAI.

[20]  Arie Ben-David,et al.  Comparison of classification accuracy using Cohen's Weighted Kappa , 2008, Expert Syst. Appl..

[21]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[22]  Robert Sabourin,et al.  Combining Diversity and Classification Accuracy for Ensemble Selection in Random Subspaces , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[23]  J. Jossinet,et al.  Classification of breast tissue by electrical impedance spectroscopy , 2006, Medical and Biological Engineering and Computing.

[24]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[25]  Michael C. Lee,et al.  A Two-Step Approach for Feature Selection and Classifier Ensemble Construction in Computer-Aided Diagnosis , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[26]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[27]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.