A novel feature selection approach for biomedical data classification

This paper presents a novel feature selection approach to deal with issues of high dimensionality in biomedical data classification. Extensive research has been performed in the field of pattern recognition and machine learning. Dozens of feature selection methods have been developed in the literature, which can be classified into three main categories: filter, wrapper and hybrid approaches. Filter methods apply an independent test without involving any learning algorithm, while wrapper methods require a predetermined learning algorithm for feature subset evaluation. Filter and wrapper methods have their, respectively, drawbacks and are complementary to each other in that filter approaches have low computational cost with insufficient reliability in classification while wrapper methods tend to have superior classification accuracy but require great computational power. The approach proposed in this paper integrates filter and wrapper methods into a sequential search procedure with the aim to improve the classification performance of the features selected. The proposed approach is featured by (1) adding a pre-selection step to improve the effectiveness in searching the feature subsets with improved classification performances and (2) using Receiver Operating Characteristics (ROC) curves to characterize the performance of individual features and feature subsets in the classification. Compared with the conventional Sequential Forward Floating Search (SFFS), which has been considered as one of the best feature selection methods in the literature, experimental results demonstrate that (i) the proposed approach is able to select feature subsets with better classification performance than the SFFS method and (ii) the integrated feature pre-selection mechanism, by means of a new selection criterion and filter method, helps to solve the over-fitting problems and reduces the chances of getting a local optimal solution.

[1]  Pavel Pudil,et al.  Flexible-Hybrid Sequential Floating Search in Statistical Feature Selection , 2006, SSPR/SPR.

[2]  Jack Perkins,et al.  Pattern recognition in practice , 1980 .

[3]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[5]  Pavel Paclík,et al.  Adaptive floating search methods in feature selection , 1999, Pattern Recognit. Lett..

[6]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[7]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[8]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[9]  B. Reiser,et al.  Estimation of the area under the ROC curve , 2002, Statistics in medicine.

[10]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[11]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[12]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[13]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[14]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[15]  Francesc J. Ferri,et al.  Comparative study of techniques for large-scale feature selection* *This work was suported by a SERC grant GR/E 97549. The first author was also supported by a FPI grant from the Spanish MEC, PF92 73546684 , 1994 .

[16]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[18]  W. N. Street,et al.  Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates. , 1994, Cancer letters.

[19]  Thomas Marill,et al.  On the effectiveness of receptors in recognition systems , 1963, IEEE Trans. Inf. Theory.

[20]  ROSA BLANCO,et al.  Gene Selection For Cancer Classification Using Wrapper Approaches , 2004, Int. J. Pattern Recognit. Artif. Intell..

[21]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[22]  Yonghong Peng,et al.  A novel ensemble machine learning for robust microarray data classification , 2006, Comput. Biol. Medicine.

[23]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[24]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Mykola Pechenizkiy,et al.  Local Dimensionality Reduction and Supervised Learning Within Natural Clusters for Biomedical Data Analysis , 2006, IEEE Transactions on Information Technology in Biomedicine.

[26]  N. Obuchowski Receiver operating characteristic curves and their use in radiology. , 2003, Radiology.

[27]  Claudio Cobelli,et al.  Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data , 2005, Bioinform..

[28]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[29]  Yonghong Peng,et al.  Effective features based on normal linear structures for detecting microcalcifications in mammograms , 2008, 2008 19th International Conference on Pattern Recognition.

[30]  Lukasz A. Kurgan,et al.  Knowledge discovery approach to automated cardiac SPECT diagnosis , 2001, Artif. Intell. Medicine.

[31]  Alireza Osareh,et al.  Machine learning techniques to diagnose breast cancer , 2010, 2010 5th International Symposium on Health Informatics and Bioinformatics.

[32]  P. Conilione,et al.  A Comparative Study on Feature Selection for E . coli Promoter Recognition A Comparative Study on Feature Selection for E . coli Promoter Recognition , 2006 .

[33]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[34]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Bernard De Baets,et al.  Feature subset selection for splice site prediction , 2002, ECCB.

[36]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[37]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[38]  W. N. Street,et al.  Image analysis and machine learning applied to breast cancer diagnosis and prognosis. , 1995, Analytical and quantitative cytology and histology.

[39]  Igor Kononenko,et al.  Inductive and Bayesian learning in medical diagnosis , 1993, Appl. Artif. Intell..

[40]  M. Zweig,et al.  Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. , 1993, Clinical chemistry.

[41]  Gengfeng Wu,et al.  Dimension reduction with redundant gene elimination for tumor classification , 2008, BMC Bioinformatics.