Gene-Expression-Based Cancer Subtypes Prediction Through Feature Selection and Transductive SVM

With the advancement of microarray technology, gene expression profiling has shown great potential in outcome prediction for different types of cancers. Microarray cancer data, organized as samples versus genes fashion, are being exploited for the classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer type. Nevertheless, small sample size remains a bottleneck to design suitable classifiers. Traditional supervised classifiers can only work with labeled data. On the other hand, a large number of microarray data that do not have adequate follow-up information are disregarded. A novel approach to combine feature (gene) selection and transductive support vector machine (TSVM) is proposed. We demonstrated that 1) potential gene markers could be identified and 2) TSVMs improved prediction accuracy as compared to the standard inductive SVMs (ISVMs). A forward greedy search algorithm based on consistency and a statistic called signal-to-noise ratio were employed to obtain the potential gene markers. The selected genes of the microarray data were then exploited to design the TSVM. Experimental results confirm the effectiveness of the proposed technique compared to the ISVM and low-density separation method in the area of semisupervised cancer classification as well as gene-marker identification.

[1]  A. Oshima,et al.  Gene expression signatures to predict the response of gastric cancer to cisplatin and fluorouracil. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  Witold Pedrycz,et al.  Positive approximation: An accelerator for attribute reduction in rough set theory , 2010, Artif. Intell..

[3]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[5]  S. Bandyopadhyay,et al.  Combining Pareto-optimal clusters using supervised learning for identifying co-expressed genes , 2009, BMC Bioinformatics.

[6]  Ujjwal Maulik Analysis of gene microarray data in a soft computing framework , 2011, Appl. Soft Comput..

[7]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[8]  Ziv Bar-Joseph,et al.  A Semi-Supervised Method for Predicting Transcription Factor–Gene Interactions in Escherichia coli , 2008, PLoS Comput. Biol..

[9]  Ash A. Alizadeh,et al.  Association of a leukemic stem cell gene expression signature with clinical outcomes in acute myeloid leukemia. , 2010, JAMA.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Ujjwal Maulik,et al.  Development of the human cancer microRNA network , 2010 .

[12]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[13]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[14]  Mikhail Belkin,et al.  Manifold Regularization : A Geometric Framework for Learning from Examples , 2004 .

[15]  Qinghua Hu,et al.  Information-preserving hybrid data reduction based on fuzzy-rough techniques , 2006, Pattern Recognit. Lett..

[16]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[17]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[18]  Ujjwal Maulik,et al.  Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data , 2010, Comput. Oper. Res..

[19]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20]  Erwin Kreyszig,et al.  Introductory Mathematical Statistics. , 1970 .

[21]  Devin C. Koestler,et al.  Semi-supervised recursively partitioned mixture models for identifying cancer subtypes , 2010, Bioinform..

[22]  Guoping Wang,et al.  Learning with progressive transductive Support Vector Machine , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Qiang Shen,et al.  Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[25]  Junhui Wang,et al.  Large Margin Semi-supervised Learning , 2007, J. Mach. Learn. Res..

[26]  Israel Steinfeld,et al.  Clinically driven semi-supervised class discovery in gene expression data , 2008, ECCB.

[27]  S. Sathiya Keerthi,et al.  Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..

[28]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithms for Clustering - Applications in Data Mining and Bioinformatics , 2011 .

[29]  Ujjwal Maulik,et al.  An improved algorithm for clustering gene expression data , 2007, Bioinform..

[30]  Ujjwal Maulik,et al.  Multi-Class Clustering of Cancer Subtypes through SVM Based Ensemble of Pareto-Optimal Solutions for Gene Marker Identification , 2010, PloS one.

[31]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[32]  Ujjwal Maulik,et al.  Gene Identification: Classical and Computational Intelligence Approaches , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).