Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules

MOTIVATION Microarrays datasets frequently contain a large number of missing values (MVs), which need to be estimated and replaced for subsequent data mining. The focus of the paper is to study the effects of different MV treatments for cDNA microarray data on disease classification analysis. RESULTS By analyzing five datasets, we demonstrate that among three kinds of classifiers evaluated in this study, support vector machine (SVM) classifiers are robust to varied MV imputation methods [e.g. replacing MVs by zero, K nearest-neighbor (KNN) imputation algorithm, local least square imputation and Bayesian principal component analysis], while the classification and regression tree classifiers are sensitive in terms of classification accuracy. The KNNclassifiers built on differentially expressed genes (DEGs) are robust to the varied MV treatments, but the performances of the KNN classifiers based on all measured genes can be significantly deteriorated when imputing MVs for genes with larger missing rate (MR) (e.g. MR > 5%). Generally, while replacing MVs by zero performs relatively poor, the other imputation algorithms have little difference in affecting classification performances of the SVM or KNN classifiers. We further demonstrate the power and feasibility of our recently proposed functional expression profile (FEP) approach as means to handle microarray data with MVs. The FEPs, which are derived from the functional modules that are enriched with sets of DEGs and thus can be consistently identified under varied MV treatments, achieve precise disease classification with better biological interpretation. We conclude that the choice of MV treatments should be determined in context of the later approaches used for disease classification. The suggested exclusion criterion of ignoring the genes with larger MR (e.g. >5%), while justifiable for some classifiers such as KNN classifiers, might not be considered as a general rule for all classifiers.

[1]  A. Frigessi,et al.  The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..

[2]  Rebecka Jörnsten,et al.  DNA microarray data imputation and significance analysis of differential expression , 2005, Bioinform..

[3]  Roland Eils,et al.  Cross-platform analysis of cancer microarray data improves gene expression based classification of phenotypes , 2005, BMC Bioinformatics.

[4]  Stanley N Cohen,et al.  Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[5]  E. Topol,et al.  Towards precise classification of cancers based on robust gene functional expression profiles , 2005, BMC Bioinformatics.

[6]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[7]  M. Ittmann,et al.  The role of fibroblast growth factors and their receptors in prostate cancer. , 2004, Endocrine-related cancer.

[8]  M. Rijn,et al.  Novel endothelial cell markers in hepatocellular carcinoma , 2004, Modern Pathology.

[9]  D. Koller,et al.  A module map showing conditional activity of expression modules in cancer , 2004, Nature Genetics.

[10]  Kei-Hoi Cheung,et al.  Handling multiple testing while interpreting microarrays with the Gene Ontology Database , 2004, BMC Bioinformatics.

[11]  Serge A. Hazout,et al.  Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering , 2004, BMC Bioinformatics.

[12]  Jan Komorowski,et al.  Gene expression based classification of gastric carcinoma. , 2004, Cancer letters.

[13]  David Botstein,et al.  Different gene expression patterns in invasive lobular and ductal carcinomas of the breast. , 2004, Molecular biology of the cell.

[14]  R. Breitling,et al.  Iterative Group Analysis (iGA): A simple tool to enhance sensitivity and facilitate interpretation of microarray experiments , 2004, BMC Bioinformatics.

[15]  David R. Bickel,et al.  Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes , 2004, Bioinform..

[16]  Edward R. Dougherty,et al.  Is cross-validation better than resubstitution for ranking genes? , 2004, Bioinform..

[17]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[19]  Brad T. Sherman,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[20]  D. Botstein,et al.  Variation in gene expression patterns in human gastric cancers. , 2003, Molecular biology of the cell.

[21]  Heping Zhang,et al.  Cell and tumor classification using gene expression data: Construction of forests , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  P. Khatri,et al.  Global functional profiling of gene expression ? ? This work was funded in part by a Sun Microsystem , 2003 .

[23]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[24]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[25]  J. Ross,et al.  Co-downregulation of cell adhesion proteins α- and β-catenins, p120CTN, E-cadherin, and CD44 in prostatic adenocarcinomas , 2001 .

[26]  R. Tibshirani,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[27]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[28]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[29]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[30]  J. Hopfield,et al.  From molecular to modular cell biology , 1999, Nature.

[31]  L. Bourguignon,et al.  Interaction of CD44 variant isoforms with hyaluronic acid and the cytoskeleton in human prostate cancer cells , 1995, Journal of cellular physiology.

[32]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[33]  Noel S Weiss,et al.  Prostate carcinoma incidence in relation to prediagnostic circulating levels of insulin‐like growth factor I, insulin‐like growth factor binding protein 3, and insulin , 2005, Cancer.

[34]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[35]  P. Khatri,et al.  Global functional profiling of gene expression. , 2003, Genomics.

[36]  M. Radmacher,et al.  Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. , 2003, Journal of the National Cancer Institute.

[37]  F. Tomaselli,et al.  DNA Microarray , 2002 .

[38]  C. Sheehan,et al.  Co-downregulation of cell adhesion proteins alpha- and beta-catenins, p120CTN, E-cadherin, and CD44 in prostatic adenocarcinomas. , 2001, Human pathology.