A method of dual-process sample selection for feature selection on gene expression data

A method of dual-process sample selection based on support vector machine (SVM) is proposed to select informative features in this paper. Samples in a training set are used to train a SVM model, and the samples excluding support vector samples are chosen to select critical features in the procedure of recursive feature elimination (RFE). The effect of the dual-process sample selection method on feature selection is evaluated using the classification and the clustering performance of the selected features. The proposed dual-process sample selection method is applied to five gene expression datasets, and the experimental results show that the method is useful to improve the performance of the feature selection method based on fuzzy interactive self-organizing data algorithm (ISODATA). This indicates the method is reliable and effective for selecting informative genes from gene expression data.   Key words: Feature selection, support vector machine, fuzzy interactive self-organizing data algorithm (ISODATA), dual-process sample selection.

[1]  Özge Uncu,et al.  A novel feature selection approach: Combining feature wrappers and filters , 2007, Inf. Sci..

[2]  Jieping Ye,et al.  Drosophila Gene Expression Pattern Annotation through Multi-Instance Multi-Label Learning , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[4]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[5]  Francesco Marcelloni,et al.  Feature selection based on a modified fuzzy C-means algorithm with supervision , 2003, Inf. Sci..

[6]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[10]  Slobodan Vucetic,et al.  Improving accuracy of microarray classification by a simple multi-task feature selection filter , 2011, Int. J. Data Min. Bioinform..

[11]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[12]  John Crowley,et al.  Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. , 2002, Blood.

[13]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[15]  Qiang Shen,et al.  Feature selection for aiding glass forensic evidence analysis , 2009, Intell. Data Anal..

[16]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[17]  David G. Stork,et al.  Pattern Classification , 1973 .

[18]  Yuanyuan Li,et al.  Feature selection based on sensitivity analysis of fuzzy ISODATA , 2012, Neurocomputing.

[19]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[20]  Aníbal R. Figueiras-Vidal,et al.  Sample selection via clustering to construct support vector-like classifiers , 1999, IEEE Trans. Neural Networks.

[21]  J. Bezdek A Physical Interpretation of Fuzzy ISODATA , 1993 .

[22]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[23]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[24]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[25]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..