Cascading SVMS as a Tool for Medical Diagnosis Using Multi-class Gene Expression Data

In this paper we propose a novel Support Vector Machines-based architecture for medical diagnosis using multi-class gene expression data. It consists of a pre-processing unit and N-1 sequentially ordered blocks capable of classifying N classes in a cascading manner. Each block embodies both a gene selection and a classification module. It offers the flexibility of constructing block-specific gene expression spaces and hypersurfaces for the discrimination of the different classes. The proposed architecture was applied for medical diagnostic tasks including prostate and lung cancer diagnosis. Its performance was evaluated by using a leave-one-out cross validation approach which avoids the bias introduced by the gene selection process. The results show that it provides high accuracy which in most cases exceeds the accuracy achieved by the popular one-vs-one and one-vs-all SVM combination schemes and Nearest-Neighbor classifiers. The cascading SVMs can be successfully applied as a medical diagnostic tool.

[1]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[2]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[4]  G. Magoulas,et al.  Improved processing of microarray data using image reconstruction techniques , 2003, IEEE Transactions on NanoBioscience.

[5]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[6]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[7]  B. L. Welch The generalisation of student's problems when several different population variances are involved. , 1947, Biometrika.

[8]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  Stavros A. Karkanis,et al.  Intelligent Analysis of Genomic Measurements , 2004 .

[11]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[12]  I. Shmulevich,et al.  Computational and Statistical Approaches to Genomics , 2007, Springer US.

[13]  E. Boerwinkle,et al.  Feature (gene) selection in gene expression-based tumor classification. , 2001, Molecular genetics and metabolism.

[14]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[15]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[16]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[17]  B. De Moor,et al.  Maximal variation and missing values for componentwise support vector machines , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[18]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[19]  Sung-Bae Cho,et al.  Gene expression classification using optimal feature/classifier ensemble with negative correlation , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[20]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[21]  Constantin F. Aliferis,et al.  Using Support Vector Machines for Multicategory Cancer Diagnosis Based on Gene Expression Data , 2003 .

[22]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[23]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[25]  David Botstein,et al.  SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data , 2003, Nucleic Acids Res..

[26]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[27]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[28]  Adil M. Bagirov,et al.  New algorithms for multi-class cancer diagnosis using tumor gene expression signatures , 2003, Bioinform..

[29]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[30]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[31]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[32]  D.K. Iakovidis,et al.  A cascading support vector machines system for gene expression data classification , 2004, 2004 2nd International IEEE Conference on 'Intelligent Systems'. Proceedings (IEEE Cat. No.04EX791).

[33]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[34]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[35]  Wei Pan,et al.  A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments , 2002, Bioinform..

[36]  F. Bray,et al.  Cancer burden in the year 2000. The global picture. , 2001, European journal of cancer.

[37]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[39]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[40]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[41]  Gerard V. Trunk,et al.  A Problem of Dimensionality: A Simple Example , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  D W Galbraith,et al.  High-density microarrays for gene expression analysis. , 2001, Cytometry.

[43]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[44]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.