Fuzzy Preference Based Feature Selection and Semisupervised SVM for Cancer Classification

DNA microarray data now permit scientists to screen thousand of genes simultaneously and determine whether those genes are active or silent in normal and cancerous tissues. With the advancement of microarray technology, new analytical methods must be developed to find out whether microarray data have discriminative signatures of gene expression over normal or cancerous tissues. In this paper, we attempt a prediction scheme that combines fuzzy preference based rough set (FPRS) method for feature (gene) selection with semisupervised SVMs. To show the effectiveness of the proposed approach, we compare the performance of this technique with the signal-to-noise ratio (SNR) and consistency based feature selection (CBFS) methods. Using six benchmark gene microarray datasets (including both binary and multi-class classification problems), we demonstrate experimentally that our proposed scheme can achieve significant empirical success and is biologically relevant for cancer diagnosis and drug discovery.

[1]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[2]  Andrzej Skowron,et al.  Rough sets and Boolean reasoning , 2007, Inf. Sci..

[3]  S. Sathiya Keerthi,et al.  Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..

[4]  Francisco Herrera,et al.  Some issues on consistency of fuzzy preference relations , 2004, Eur. J. Oper. Res..

[5]  A. Balmain,et al.  How many mutations are required for tumorigenesis? implications from human cancer data , 1993 .

[6]  Ziv Bar-Joseph,et al.  A Semi-Supervised Method for Predicting Transcription Factor–Gene Interactions in Escherichia coli , 2008, PLoS Comput. Biol..

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[9]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[10]  Qinghua Hu,et al.  Fuzzy preference based rough sets , 2010, Inf. Sci..

[11]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[12]  Andrzej Skowron,et al.  Rough sets: Some extensions , 2007, Inf. Sci..

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[15]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[16]  E. Pitman Significance Tests Which May be Applied to Samples from Any Populations , 1937 .

[17]  Lorenzo Bruzzone,et al.  An advanced semi-supervised SVM classifier for the analysis of hyperspectral remote sensing data , 2006, SPIE Remote Sensing.

[18]  Ujjwal Maulik,et al.  Development of the human cancer microRNA network , 2010 .

[19]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[20]  Julius T. Tou,et al.  Pattern Recognition Principles , 1974 .

[21]  Israel Steinfeld,et al.  Clinically driven semi-supervised class discovery in gene expression data , 2008, ECCB.

[22]  S. Bandyopadhyay,et al.  Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes , 2009, BMC Bioinformatics.

[23]  William Stafford Noble,et al.  Semi-supervised learning for peptide identification from shotgun proteomics datasets , 2007, Nature Methods.

[24]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[25]  Ujjwal Maulik,et al.  Gene-Expression-Based Cancer Subtypes Prediction Through Feature Selection and Transductive SVM , 2013, IEEE Transactions on Biomedical Engineering.

[26]  Ujjwal Maulik,et al.  Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification , 2010, PloS one.

[27]  E. J. G. Pitman,et al.  Significance Tests Which May be Applied to Samples from Any Populations. II. The Correlation Coefficient Test , 1937 .

[28]  Ujjwal Maulik Analysis of gene microarray data in a soft computing framework , 2011, Appl. Soft Comput..

[29]  J. I The Design of Experiments , 1936, Nature.

[30]  Ujjwal Maulik,et al.  Gene Identification: Classical and Computational Intelligence Approaches , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  Ash A. Alizadeh,et al.  Association of a leukemic stem cell gene expression signature with clinical outcomes in acute myeloid leukemia. , 2010, JAMA.

[32]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[33]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[34]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[35]  A. Oshima,et al.  Gene expression signatures to predict the response of gastric cancer to cisplatin and fluorouracil. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[36]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[37]  Witold Pedrycz,et al.  Positive approximation: An accelerator for attribute reduction in rough set theory , 2010, Artif. Intell..

[38]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.