Multiclass cancer classification and biomarker discovery using GA-based algorithms

MOTIVATION The development of microarray-based high-throughput gene profiling has led to the hope that this technology could provide an efficient and accurate means of diagnosing and classifying tumors, as well as predicting prognoses and effective treatments. However, the large amount of data generated by microarrays requires effective reduction of discriminant gene features into reliable sets of tumor biomarkers for such multiclass tumor discrimination. The availability of reliable sets of biomarkers, especially serum biomarkers, should have a major impact on our understanding and treatment of cancer. RESULTS We have combined genetic algorithm (GA) and all paired (AP) support vector machine (SVM) methods for multiclass cancer categorization. Predictive features can be automatically determined through iterative GA/SVM, leading to very compact sets of non-redundant cancer-relevant genes with the best classification performance reported to date. Interestingly, these different classifier sets harbor only modest overlapping gene features but have similar levels of accuracy in leave-one-out cross-validations (LOOCV). Further characterization of these optimal tumor discriminant features, including the use of nearest shrunken centroids (NSC), analysis of annotations and literature text mining, reveals previously unappreciated tumor subclasses and a series of genes that could be used as cancer biomarkers. With this approach, we believe that microarray-based multiclass molecular analysis can be an effective tool for cancer biomarker discovery and subsequent molecular cancer diagnosis.

[1]  S. Dhanasekaran,et al.  Delineation of prognostic biomarkers in prostate cancer , 2001, Nature.

[2]  Wei Du,et al.  Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines , 2003, FEBS letters.

[3]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[4]  Y. Pouliot,et al.  DIAN: a novel algorithm for genome ontological classification. , 2001, Genome research.

[5]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Patrick Tan,et al.  Genetic algorithms applied to multi-class prediction for the analysis of gene expression data , 2003, Bioinform..

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[10]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  D. Haussler,et al.  Knowledge-based analysis of microarray gene expression , 2000 .

[12]  Kalyanmoy Deb,et al.  A Comparative Analysis of Selection Schemes Used in Genetic Algorithms , 1990, FOGA.

[13]  J. Welsh,et al.  Large-scale delineation of secreted protein biomarkers overexpressed in cancer tissue and serum , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[15]  Kamesh Munagala,et al.  Cancer characterization and feature set extraction by discriminative margin clustering , 2004, BMC Bioinformatics.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[18]  Sayan Mukherjee,et al.  Molecular classification of multiple tumor types , 2001, ISMB.

[19]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[20]  Ryszard Maleszka,et al.  Microarray reality checks in the context of a complex disease , 2004, Nature Biotechnology.

[21]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[22]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[23]  E. Diamandis Mass Spectrometry as a Diagnostic and a Cancer Biomarker Discovery Tool , 2004, Molecular & Cellular Proteomics.

[24]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[25]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[26]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[27]  R. Berkowitz,et al.  Prostasin, a potential serum marker for ovarian cancer: identification through microarray technology. , 2001, Journal of the National Cancer Institute.

[28]  K. Kinzler,et al.  Identifying markers for pancreatic cancer by gene expression analysis. , 1998, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[29]  R. Berkowitz,et al.  Osteopontin as a potential diagnostic biomarker for ovarian cancer. , 2002, JAMA.

[30]  Debashis Ghosh,et al.  alpha-Methylacyl coenzyme A racemase as a tissue biomarker for prostate cancer. , 2002, JAMA.

[31]  C. Sugnet,et al.  Knowledge-based Analysis of Mi roarray Gene Expression Data , 2007 .

[32]  Liangbiao Chen,et al.  GoPipe: Streamlined Gene Ontology annotation for batch anonymous sequences with statistics , 2005 .

[33]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[34]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[35]  R. Gittes,et al.  Prostate-specific antigen. , 1987, The New England journal of medicine.