Multiclass cancer classification by support vector machines with class-wise optimized genes and probability estimates.

We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets.

[1]  K. Chou,et al.  ProtIdent: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information. , 2008, Biochemical and biophysical research communications.

[2]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[3]  Kuo-Chen Chou,et al.  Identification of proteases and their types. , 2009, Analytical biochemistry.

[4]  Kuo-Chen Chou,et al.  HIVcleave: a web-server for predicting human immunodeficiency virus protease cleavage sites in proteins. , 2008, Analytical biochemistry.

[5]  Li M Fu,et al.  Multi‐class cancer subtype classification based on gene expression signatures with reliability analysis , 2004, FEBS letters.

[6]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[7]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[8]  Shinzaburo Noguchi,et al.  mRNA expression level of estrogen‐inducible gene, α1‐antichymotrypsin, is a predictor of early tumor recurrence in patients with invasive breast cancers , 2004, Cancer science.

[9]  Isabelle Guyon,et al.  Comparison of classifier methods: a case study in handwritten digit recognition , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[10]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[11]  J. Mesirov,et al.  Chemosensitivity prediction by transcriptional profiling , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Kiichiro Noda,et al.  Cytopathologic and Clinicopathologic Features of Ovarian Hepatoid Carcinoma , 2003, Acta Cytologica.

[14]  K. Chou,et al.  Energy-optimized structure of antifreeze protein and its binding mechanism. , 1992, Journal of molecular biology.

[15]  Bing Niu,et al.  Predicting subcellular localization with AdaBoost Learner. , 2008, Protein and peptide letters.

[16]  Kuo-Chen Chou,et al.  Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[17]  E. Boerwinkle,et al.  Feature (gene) selection in gene expression-based tumor classification. , 2001, Molecular genetics and metabolism.

[18]  H.-B. Shen,et al.  Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction , 2007, Amino Acids.

[19]  J Yavelow,et al.  Alpha 1-antitrypsin blocks the release of transforming growth factor-alpha from MCF-7 human breast cancer cells. , 1997, The Journal of clinical endocrinology and metabolism.

[20]  K. Chou,et al.  Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. , 2006, Biochemical and biophysical research communications.

[21]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[22]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[23]  Xin Yao,et al.  Gene selection algorithms for microarray data based on least squares support vector machine , 2006, BMC Bioinformatics.

[24]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[25]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[26]  Tang-Yi Tsao,et al.  Hepatoid Carcinoma of the Ovary: Immunohistochemical Finding of One Case and Literature Review , 2007 .

[27]  Sayan Mukherjee,et al.  An Analytical Method for Multiclass Molecular Cancer Classification , 2003, SIAM Rev..

[28]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[29]  Zhanchao Li,et al.  Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. , 2007, Journal of theoretical biology.

[30]  Cristian R. Munteanu,et al.  Enzymes/non-enzymes classification model complexity based on composition, sequence, 3D and topological indices. , 2008, Journal of theoretical biology.

[31]  J. Katz,et al.  α1-Antitrypsin Blocks the Release of Transforming Growth Factor-α from MCF-7 Human Breast Cancer Cells1 , 1997 .

[32]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[33]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[34]  R. Ghossein,et al.  Molecular detection of micrometastases and circulating tumor cells in melanoma prostatic and breast carcinomas. , 2000, In vivo.

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[37]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[38]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[39]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[40]  Stephen Tomlinson,et al.  The effect of dexamethasone on human mucin 1 expression and antibody‐dependent complement sensitivity in a prostate cancer cell line in vitro and in vivo , 2004, Immunology.

[41]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[42]  K. Chou Structural bioinformatics and its impact to biomedical science. , 2004, Current medicinal chemistry.

[43]  Xiaoyong Zou,et al.  Predicting protein structural class based on multi-features fusion. , 2008, Journal of theoretical biology.

[44]  D. Rudnick,et al.  CONCISE REVIEW IN MECHANISMS OF DISEASE Alpha-1-Antitrypsin Deficiency: A New Paradigm for Hepatocellular Carcinoma in Genetic Liver Disease , 2005 .

[45]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[46]  K. Chou,et al.  Recent advances in QSAR and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design. , 2008, Current protein & peptide science.

[47]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[48]  Liang Liu,et al.  Predicting membrane protein types with bragging learner. , 2008, Protein and peptide letters.

[49]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[50]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[51]  A. Haitel,et al.  Expression of aquaporins and PAX-2 compared to CD10 and cytokeratin 7 in renal neoplasms: a tissue microarray study , 2005, Modern Pathology.

[52]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[53]  George Karypis,et al.  Building multiclass classifiers for remote homology detection and fold recognition , 2006, BMC Bioinformatics.

[54]  Ganesan Pugalenthi,et al.  Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. , 2008, Journal of theoretical biology.

[55]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[56]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[57]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[58]  Wei Chu,et al.  Multi-category Classification by Soft-Max Combination of Binary Classifiers , 2003, Multiple Classifier Systems.

[59]  K. Chou,et al.  Signal-3L: A 3-layer approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[60]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[61]  K. Chou,et al.  Unified QSAR approach to antimicrobials. Part 3: first multi-tasking QSAR model for input-coded prediction, structural back-projection, and complex networks clustering of antiprotozoal compounds. , 2008, Bioorganic & medicinal chemistry.

[62]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[63]  Tony E Godfrey,et al.  A Combination of Molecular Markers Accurately Detects Lymph Node Metastasis in Non–Small Cell Lung Cancer Patients , 2006, Clinical Cancer Research.

[64]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[65]  Tao Peng,et al.  Serum proteomic-based analysis of pancreatic carcinoma for the identification of potential cancer biomarkers. , 2007, Biochimica et biophysica acta.

[66]  Kimberly F. Sellers,et al.  Systematic Variation in Genetic Microarray Data , 2004 .

[67]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[68]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[69]  Sayan Mukherjee,et al.  Classifying Microarray Data Using Support Vector Machines , 2003 .

[70]  Janie Roberson,et al.  Interobserver variability , 2002, Cancer.

[71]  A. Gazdar,et al.  Interobserver variability in histopathologic subtyping and grading of pulmonary adenocarcinoma , 1993, Cancer.

[72]  K. Chou Prediction of human immunodeficiency virus protease cleavage sites in proteins. , 1996, Analytical biochemistry.

[73]  C. Domeniconi,et al.  An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification , 2004 .

[74]  Peter Brossart,et al.  Immunologic and clinical responses after vaccinations with peptide-pulsed dendritic cells in metastatic renal cancer patients. , 2006, Cancer research.

[75]  Marina Vannucci,et al.  Gene selection: a Bayesian variable selection approach , 2003, Bioinform..