On the statistical assessment of classifiers using DNA microarray data

BackgroundIn this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data.ResultsWe estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed.ConclusionsThe method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required.

[1]  Vladimir Petrovic,et al.  Forkhead Box M1 Regulates the Transcriptional Network of Genes Essential for Mitotic Progression and Genes Encoding the SCF (Skp2-Cks1) Ubiquitin Ligase , 2005, Molecular and Cellular Biology.

[2]  K Takahashi,et al.  Prognostic impact of FAS/CD95/APO-1 in urothelial cancers: decreased expression of Fas is associated with disease progression , 2005, British Journal of Cancer.

[3]  Thomas E. Nichols,et al.  Nonparametric permutation tests for functional neuroimaging: A primer with examples , 2002, Human brain mapping.

[4]  Geert J. P. L. Kops,et al.  On the road to cancer: aneuploidy and the mitotic checkpoint , 2005, Nature Reviews Cancer.

[5]  Arie Perry,et al.  Integrative genomic analysis identifies NDRG2 as a candidate tumor suppressor gene frequently inactivated in clinically aggressive meningioma. , 2005, Cancer research.

[6]  Lea-Yea Chuang,et al.  RNA Silencing of Cks1 Induced G2/M Arrest and Apoptosis in Human Lung Cancer Cells , 2005, IUBMB life.

[7]  V. Yang,et al.  Identification of Krüppel-like factor 4 as a potential tumor suppressor gene in colorectal cancer , 2004, Oncogene.

[8]  Jing Peng,et al.  SVM vs regularized least squares classification , 2004, ICPR 2004.

[9]  Peter J. Meier,et al.  Organic anion transporting polypeptides of the OATP/SLC21 family: phylogenetic classification as OATP/SLCO superfamily, new nomenclature and molecular/functional properties , 2004, Pflügers Archiv.

[10]  Jill P. Mesirov,et al.  Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[11]  N. Colburn,et al.  Tumorigenesis Suppressor Pdcd4 Down-Regulates Mitogen-Activated Protein Kinase Kinase Kinase Kinase 1 Expression To Suppress Colon Carcinoma Cell Invasion , 2006, Molecular and Cellular Biology.

[12]  T Tanaka,et al.  Up-regulation of the ectodermal-neural cortex 1 (ENC1) gene, a downstream target of the beta-catenin/T-cell factor complex, in colorectal carcinomas. , 2001, Cancer research.

[13]  Cesare Furlanello,et al.  Entropy-based gene ranking without selection bias for the predictive classification of microarray data , 2003, BMC Bioinformatics.

[14]  Ju-Hyung Woo,et al.  Increased Expression of Mitotic Checkpoint Genes in Breast Cancer Cells with Chromosomal Instability , 2006, Clinical Cancer Research.

[15]  Tsung-Teh Wu,et al.  Drastic down-regulation of Krüppel-like factor 4 expression is critical in human gastric cancer development and progression. , 2005, Cancer research.

[16]  Pamela A. Silver,et al.  Nuclear transport and cancer: from mechanism to intervention , 2004, Nature Reviews Cancer.

[17]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[18]  Jian Kuang,et al.  Tumour amplified kinase STK15/BTAK induces centrosome amplification, aneuploidy and transformation , 1998, Nature Genetics.

[19]  Keith D Robertson,et al.  Isolation and characterization of a novel DNA methyltransferase complex linking DNMT3B with components of the mitotic chromosome condensation machinery. , 2004, Nucleic acids research.

[20]  Hans Clevers,et al.  SOX9 is an intestine crypt transcription factor, is regulated by the Wnt pathway, and represses the CDX2 and MUC2 genes , 2004, The Journal of cell biology.

[21]  D. Salomon,et al.  Cripto: A tumor growth factor and more , 2002, Journal of cellular physiology.

[22]  K. Kinzler,et al.  Characterization of MAD2B and other mitotic spindle checkpoint genes. , 1999, Genomics.

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[24]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[25]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[26]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[27]  D. Bostwick,et al.  Loss of expression of the DRR 1 gene at chromosomal segment 3p21.1 in renal cell carcinoma , 2000, Genes, chromosomes & cancer.

[28]  Iver Petersen,et al.  Loss of PDCD4 expression in human lung cancer correlates with tumour progression and prognosis , 2003, The Journal of pathology.

[29]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Laurence J Miller,et al.  Targeting the RNA splicing machinery as a novel treatment strategy for pancreatic carcinoma. , 2006, Cancer research.

[31]  Michele Pagano,et al.  Alterations in the expression of the cell cycle regulatory protein cyclin kinase subunit 1 in colorectal carcinoma , 2004, Cancer.

[32]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[33]  M. Xiong,et al.  Recursive partitioning for tumor classification with gene expression microarray data , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[34]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[35]  O. Cuvier,et al.  Chromosome Condensation by a Human Condensin Complex inXenopus Egg Extracts* , 2001, The Journal of Biological Chemistry.

[36]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Wendy A Wells,et al.  Allele‐specific loss of heterozygosity at the DAL‐1/4.1B (EPB41L3) tumor‐suppressor gene locus in the absence of mutation , 2004, Genes, chromosomes & cancer.

[38]  Christoph Wanner,et al.  Identification of a new tumor suppressor gene located at chromosome 8p21.3–22 , 2003, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[39]  Graziano Pesole,et al.  Regularized Least Squares Cancer Classifiers from DNA microarray data , 2005, BMC Bioinformatics.

[40]  Yusuke Nakamura,et al.  Genes associated with liver metastasis of colon cancer, identified by genome-wide cDNA microarray. , 2004, International journal of oncology.

[41]  Sayan Mukherjee,et al.  An Analytical Method for Multiclass Molecular Cancer Classification , 2003, SIAM Rev..

[42]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[43]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.