Using the information embedded in the testing sample to break the limits caused by the small sample size in microarray-based classification

BackgroundMicroarray-based tumor classification is characterized by a very large number of features (genes) and small number of samples. In such cases, statistical techniques cannot determine which genes are correlated to each tumor type. A popular solution is the use of a subset of pre-specified genes. However, molecular variations are generally correlated to a large number of genes. A gene that is not correlated to some disease may, by combination with other genes, express itself.ResultsIn this paper, we propose a new classiification strategy that can reduce the effect of over-fitting without the need to pre-select a small subset of genes. Our solution works by taking advantage of the information embedded in the testing samples. We note that a well-defined classification algorithm works best when the data is properly labeled. Hence, our classification algorithm will discriminate all samples best when the testing sample is assumed to belong to the correct class. We compare our solution with several well-known alternatives for tumor classification on a variety of publicly available data-sets. Our approach consistently leads to better classification results.ConclusionStudies indicate that thousands of samples may be required to extract useful statistical information from microarray data. Herein, it is shown that this problem can be circumvented by using the information embedded in the testing samples.

[1]  R. Fisher THE STATISTICAL UTILIZATION OF MULTIPLE MEASUREMENTS , 1938 .

[2]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[3]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[4]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[5]  中澤 真,et al.  Devroye, L., Gyorfi, L. and Lugosi, G. : A Probabilistic Theory of Pattern Recognition, Springer (1996). , 1997 .

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[8]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  E. Boerwinkle,et al.  Feature (gene) selection in gene expression-based tumor classification. , 2001, Molecular genetics and metabolism.

[10]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[12]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[13]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[14]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[15]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[16]  J. Ioannidis,et al.  Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment , 2003, The Lancet.

[17]  Richard M. Simon,et al.  A Paradigm for Class Prediction Using Gene Expression Profiles , 2003, J. Comput. Biol..

[18]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[19]  D. Ransohoff Rules of evidence for cancer molecular-marker discovery and validation , 2004, Nature Reviews Cancer.

[20]  P. Brown,et al.  Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[22]  BMC Bioinformatics , 2005 .

[23]  Zixiang Xiong,et al.  Optimal number of features as a function of sample size for various classification rules , 2005, Bioinform..

[24]  Aleix M. Martínez,et al.  Where are linear feature extraction methods applicable? , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  A. Olshen,et al.  Differential exoprotease activities confer tumor-specific serum peptidome patterns. , 2005, The Journal of clinical investigation.

[26]  Caroline Truntzer,et al.  Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data , 2007, BMC Bioinformatics.

[27]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[28]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Gavin D. Grant,et al.  Common markers of proliferation , 2006, Nature Reviews Cancer.

[30]  R. Nadon,et al.  Inferential literacy for experimental high-throughput biology. , 2006, Trends in genetics : TIG.

[31]  M. Vijver,et al.  Technology Insight: tuning into the genetic orchestra using microarrays—limitations of DNA microarrays in clinical practice , 2006, Nature Clinical Practice Oncology.

[32]  Aleix M. Martínez,et al.  Subclass discriminant analysis , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Min Zhang,et al.  Variable selection for large p small n regression models with incomplete data: Mapping QTL with epistases , 2007, BMC Bioinformatics.

[34]  Aleix M. Martínez,et al.  Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification , 2007, J. Mach. Learn. Res..

[35]  Mark R. Viant,et al.  Improved classification accuracy in 1- and 2-dimensional NMR metabolomics data using the variance stabilising generalised logarithm transformation , 2007, BMC Bioinformatics.

[36]  Pieter Abbeel,et al.  Max-margin Classification of Data with Absent Features , 2008, J. Mach. Learn. Res..