Classification error as a measure of gene relevance in cancer diagnosis

One of the main problems in cancer diagnosis by using DNA microarray data is selecting genes relevant for the pathology by analyzing their expression profiles in tissues in two different phenotypical conditions. The question we pose is the following: how do we measure the relevance of a single gene in a given pathology? A gene is relevant for a particular disease if it is possible to correctly predict the occurrence of the pathology in new patients on the basis of expression level of this gene only. In other words, a gene is informative for the disease if its expression levels are useful for training a classifier able to generalize, that is, able to correctly predict the status of new patients. In this paper we present a selection bias free, statistically well founded method for finding relevant genes on the basis of their classification ability. We applied the method on a colon cancer data set and produced a list of relevant genes, ranked on the basis of their prediction accuracy. We found, out of more than 6500 available genes, 54 overexpressed in normal tissue and 77 overexpressed in tumor tissue having prediction accuracy greater than 70% with p-value p les 0.05.

[1]  Thomas E. Nichols,et al.  Nonparametric permutation tests for functional neuroimaging: A primer with examples , 2002, Human brain mapping.

[2]  M. K. Byeon,et al.  Down-regulation of the down-regulated in adenoma (DRA) gene correlates with colon tumor progression. , 1998, Clinical cancer research : an official journal of the American Association for Cancer Research.

[3]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[4]  Edward R. Dougherty,et al.  How many samples are needed to build a classifier: a general sequential approach , 2005, Bioinform..

[5]  B. Vallee,et al.  Cell cycle regulation of metallothionein in human colonic cancer cells. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Li M. Fu,et al.  Evaluation of gene importance in microarray data based upon probability of selection , 2005, BMC Bioinformatics.

[7]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[8]  T. Ørntoft,et al.  Gene expression in colorectal cancer. , 2002, Cancer research.

[9]  T. Golub,et al.  DNA microarrays in clinical oncology. , 2002, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[10]  Jill P. Mesirov,et al.  Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[11]  Ivo Grosse,et al.  Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression , 2004, J. Comput. Biol..

[12]  Doron Lancet,et al.  Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification , 2005, Bioinform..

[13]  B. Rigas,et al.  Nonsteroidal antiinflammatory drugs inhibit the proliferation of colon adenocarcinoma cells: effects on cell cycle and apoptosis. , 1996, Experimental cell research.

[14]  Sayan Mukherjee,et al.  Permutation Tests for Classification , 2005, COLT.

[15]  Minalini Lakshman,et al.  CD44 promotes resistance to apoptosis in murine colonic epithelium , 2005, Journal of cellular physiology.

[16]  Jing Peng,et al.  SVM vs regularized least squares classification , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[17]  Graziano Pesole,et al.  Regularized Least Squares Cancer Classifiers from DNA microarray data , 2005, BMC Bioinformatics.

[18]  H. Ueno,et al.  Expression of carbonic anhydrase I or II and correlation to clinical aspects of colorectal cancer. , 2000, Hepato-gastroenterology.

[19]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[20]  Graziano Pesole,et al.  On the statistical assessment of classifiers using DNA microarray data , 2006, BMC Bioinformatics.

[21]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[22]  David C. Atkins,et al.  Gene expression profiles and molecular markers to predict recurrence of Dukes' B colon cancer. , 2004, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[23]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Yi Li,et al.  Bayesian automatic relevance determination algorithms for classifying gene expression data. , 2002, Bioinformatics.

[25]  J. Q. Li,et al.  Expression of cyclin E and cyclin-dependent kinase 2 correlates with metastasis and prognosis in colorectal carcinoma. , 2001, Human pathology.

[26]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[27]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[28]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[29]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[30]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[31]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[33]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[34]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..