Evaluating Methods for Classifying Expression Data

Abstract An attractive application of expression technologies is to predict drug efficacy or safety using expression data of biomarkers. To evaluate the performance of various classification methods for building predictive models, we applied these methods on six expression datasets. These datasets were from studies using microarray technologies and had either two or more classes. From each of the original datasets, two subsets were generated to simulate two scenarios in biomarker applications. First, a 50-gene subset was used to simulate a candidate gene approach when it might not be practical to measure a large number of genes/biomarkers. Next, a 2000-gene subset was used to simulate a whole genome approach. We evaluated the relative performance of several classification methods by using leave-one-out cross-validation and bootstrap cross-validation. Although all methods perform well in both subsets for a relative easy dataset with two classes, differences in performance do exist among methods for other datasets. Overall, partial least squares discriminant analysis (PLS-DA) and support vector machines (SVM) outperform all other methods. We suggest a practical approach to take advantage of multiple methods in biomarker applications.

[1]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[2]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[3]  William C Reinhold,et al.  Diagnostic markers that distinguish colon and ovarian adenocarcinomas: identification by genomic, proteomic, and tissue array profiling. , 2003, Cancer research.

[4]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[5]  Emanuel F. Petricoin,et al.  Medical applications of microarray technologies: a regulatory science perspective , 2002, Nature Genetics.

[6]  C R Cantor Pharmacogenetics becomes pharmacogenomics: wake up and get ready. , 1999, Molecular diagnosis : a journal devoted to the understanding of human disease through the clinical application of molecular biology.

[7]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Yoonkyung Lee,et al.  Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data , 2003, Bioinform..

[9]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[10]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  M S Ricci,et al.  Novel strategies for therapeutic design in molecular oncology using gene expression profiles. , 2000, Current opinion in molecular therapeutics.

[12]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[13]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[14]  I. Mian,et al.  Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. , 2001, Physiological genomics.

[15]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[16]  Pentti Minkkinen,et al.  Waste water pollution modelling in the Southern Area of Lake Saimaa, Finland, by the SIMCA pattern recognition method , 1989 .

[17]  Noam Harpaz,et al.  Artificial neural networks distinguish among subtypes of neoplastic colorectal lesions. , 2002, Gastroenterology.

[18]  A. Roli Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[19]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[20]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[21]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[23]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[24]  P. Barnes,et al.  In situ hybridization. , 1997, Methods in molecular biology.

[25]  I. Mian,et al.  Analysis of molecular profile data using generative and discriminative methods. , 2000, Physiological genomics.

[26]  E. Lander,et al.  A molecular signature of metastasis in primary solid tumors , 2003, Nature Genetics.

[27]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[28]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[29]  D. Wilkinson In situ hybridization: a practical approach , 1998 .

[30]  C Stratowa,et al.  CDNA microarray gene expression analysis of B‐cell chronic lymphocytic leukemia proposes potential new prognostic markers involved in lymphocyte trafficking , 2001, International journal of cancer.

[31]  R. Molina,et al.  On the Combination of Nonparametric NearestNeighbor Classi cation and Contextual Correction , 1995 .

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[34]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[35]  Huan Liu,et al.  Book review: Machine Learning, Neural and Statistical Classification Edited by D. Michie, D.J. Spiegelhalter and C.C. Taylor (Ellis Horwood Limited, 1994) , 1996, SGAR.

[36]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[37]  Miguel Figueroa,et al.  Competitive learning with floating-gate circuits , 2002, IEEE Trans. Neural Networks.

[38]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  T. Bumol,et al.  Genetic information, genomic technologies, and the future of drug discovery. , 2001, JAMA.

[40]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[41]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[42]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[43]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[44]  O. de Vel,et al.  New Fast Algorithms for Error Rate-Based Stepwise Variable Selection in Discriminant Analysis , 2000, SIAM J. Sci. Comput..

[45]  D B Kell,et al.  Variable selection in discriminant partial least-squares analysis. , 1998, Analytical chemistry.

[46]  Carsten O. Peterson,et al.  Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. , 2001, Cancer research.

[47]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[48]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[49]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[50]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[51]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[52]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[53]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[54]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[55]  Eddy Mayoraz,et al.  Improved Pairwise Coupling Classification with Correcting Classifiers , 1998, ECML.

[56]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[57]  E. Boerwinkle,et al.  Computational methods for gene expression-based tumor classification. , 2000, BioTechniques.

[58]  Gérard Dreyfus,et al.  Pairwise Neural Network Classifiers with Probabilistic Outputs , 1994, NIPS.