Factors affecting the accuracy of a class prediction model in gene expression data

BackgroundClass prediction models have been shown to have varying performances in clinical gene expression datasets. Previous evaluation studies, mostly done in the field of cancer, showed that the accuracy of class prediction models differs from dataset to dataset and depends on the type of classification function. While a substantial amount of information is known about the characteristics of classification functions, little has been done to determine which characteristics of gene expression data have impact on the performance of a classifier. This study aims to empirically identify data characteristics that affect the predictive accuracy of classification models, outside of the field of cancer.ResultsDatasets from twenty five studies meeting predefined inclusion and exclusion criteria were downloaded. Nine classification functions were chosen, falling within the categories: discriminant analyses or Bayes classifiers, tree based, regularization and shrinkage and nearest neighbors methods. Consequently, nine class prediction models were built for each dataset using the same procedure and their performances were evaluated by calculating their accuracies. The characteristics of each experiment were recorded, (i.e., observed disease, medical question, tissue/cell types and sample size) together with characteristics of the gene expression data, namely the number of differentially expressed genes, the fold changes and the within-class correlations. Their effects on the accuracy of a class prediction model were statistically assessed by random effects logistic regression. The number of differentially expressed genes and the average fold change had significant impact on the accuracy of a classification model and gave individual explained-variation in prediction accuracy of up to 72% and 57%, respectively. Multivariable random effects logistic regression with forward selection yielded the two aforementioned study factors and the within class correlation as factors affecting the accuracy of classification functions, explaining 91.5% of the between study variation.ConclusionsWe evaluated study- and data-related factors that might explain the varying performances of classification functions in non-cancerous datasets. Our results showed that the number of differentially expressed genes, the fold change, and the correlation in gene expression data significantly affect the accuracy of class prediction models.

[1]  Gene Expression Profiles Predict Emergence of Psychiatric Adverse Events in HIV/HCV-Coinfected Patients on Interferon-Based HCV Therapy , 2012, Journal of acquired immune deficiency syndromes.

[2]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[3]  G Van Assche,et al.  Mucosal gene signatures to predict response to infliximab in patients with ulcerative colitis , 2009, Gut.

[4]  Putri W. Novianti,et al.  Evaluation of Gene Expression Classification Studies: Factors Associated with Classification Performance , 2014, PloS one.

[5]  J. Ioannidis,et al.  Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment , 2003, The Lancet.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Jae Won Lee,et al.  An extensive comparison of recent classification tools applied to microarray data , 2004, Comput. Stat. Data Anal..

[8]  S. Knudsen,et al.  Prediction of immunophenotype, treatment response, and relapse in childhood acute lymphoblastic leukemia using DNA microarrays , 2004, Leukemia.

[9]  Mayte Suárez-Fariñas,et al.  Personalized medicine in psoriasis: developing a genomic classifier to predict histological response to Alefacept , 2010, BMC dermatology.

[10]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[11]  Anne-Laure Boulesteix,et al.  On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al , 2013, Bioinform..

[12]  G. Parmigiani,et al.  Genome‐wide gene expression differences in Crohn's disease and ulcerative colitis from endoscopic pinch biopsies: Insights into distinctive pathogenesis , 2007, Inflammatory bowel diseases.

[13]  M. Silverberg,et al.  Gene Expression Changes Associated with Resistance to Intravenous Corticosteroid Therapy in Children with Severe Ulcerative Colitis , 2010, PloS one.

[14]  T. Gerds,et al.  Diagnosis of ulcerative colitis before onset of inflammation by multivariate modeling of genome‐wide gene expression data , 2009, Inflammatory bowel diseases.

[15]  M. Scian,et al.  Gene Expression Changes Are Associated With Loss of Kidney Graft Function and Interstitial Fibrosis and Tubular Atrophy: Diagnosis Versus Prediction , 2011, Transplantation.

[16]  W. Markesbery,et al.  Incipient Alzheimer's disease: Microarray correlation analyses reveal major transcriptional and tumor suppressor responses , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[18]  Cor J. Veenman,et al.  A protocol for building and evaluating predictors of disease state based on microarray data , 2005, Bioinform..

[19]  F. Holsboer,et al.  Dexamethasone Stimulated Gene Expression in Peripheral Blood is a Sensitive Marker for Glucocorticoid Receptor Resistance in Depressed Patients , 2012, Neuropsychopharmacology.

[20]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[21]  M. Daveau,et al.  Gene profiling predicts rheumatoid arthritis responsiveness to IL-1Ra (anakinra). , 2011, Rheumatology.

[22]  Magda Tsolaki,et al.  A blood gene expression marker of early Alzheimer's disease. , 2013, Journal of Alzheimer's disease : JAD.

[23]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[24]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[25]  P. Fasanaro,et al.  MicroRNA Dysregulation in Diabetic Ischemic Heart Failure Patients , 2012, Diabetes.

[26]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[27]  P. Rutgeerts,et al.  Gene Expression Profiling and Response Signatures Associated With Differential Responses to Infliximab Treatment in Ulcerative Colitis , 2011, The American Journal of Gastroenterology.

[28]  S. Bressman,et al.  Expression profiling in peripheral blood reveals signature for penetrance in DYT1 dystonia , 2010, Neurobiology of Disease.

[29]  Miles Parkes,et al.  Gene expression profiling of CD8+ T cells predicts prognosis in patients with Crohn disease and ulcerative colitis. , 2011, The Journal of clinical investigation.

[30]  Maqc Consortium The MicroArray Quality Control ( MAQC )-II study of common practices for the development and validation of microarray-based predictive models , 2012 .

[31]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[32]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[33]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[34]  Anne-Laure Boulesteix,et al.  CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data , 2008, BMC Bioinformatics.

[35]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[36]  Benjamin M. Bolstad,et al.  affy - analysis of Affymetrix GeneChip data at the probe level , 2004, Bioinform..

[37]  Kayoko Sato,et al.  Clinical Score and Transcript Abundance Patterns Identify Kawasaki Disease Patients Who May Benefit From Addition of Methylprednisolone , 2009, Pediatric Research.

[38]  Cedric E. Ginestet ggplot2: Elegant Graphics for Data Analysis , 2011 .

[39]  Marinus J C Eijkemans,et al.  Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories , 2014, Statistical applications in genetics and molecular biology.

[40]  M. Mayes,et al.  Classification analysis of the transcriptosome of nonlesional cultured dermal fibroblasts from systemic sclerosis patients with early disease. , 2005, Arthritis and rheumatism.

[41]  J. Growdon,et al.  Molecular markers of early Parkinson's disease based on gene expression in blood , 2007, Proceedings of the National Academy of Sciences.

[42]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[43]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[44]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[45]  Sohail Asghar,et al.  A REVIEW OF FEATURE SELECTION TECHNIQUES IN STRUCTURE LEARNING , 2013 .

[46]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[47]  A. Wilkie,et al.  Scalp fibroblasts have a shared expression profile in monogenic craniosynostosis , 2009, Journal of Medical Genetics.

[48]  W. Kamphorst,et al.  Comprehensive mRNA Expression Profiling Distinguishes Tauopathies and Identifies Shared Molecular Pathways , 2009, PloS one.

[49]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[50]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[51]  M. Ostrowski,et al.  Distinct Transcriptional Profiles in Ex Vivo CD4+ and CD8+ T Cells Are Established Early in Human Immunodeficiency Virus Type 1 Infection and Are Characterized by a Chronic Interferon Response as Well as Extensive Transcriptional Changes in CD8+ T Cells , 2007, Journal of Virology.

[52]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[53]  Andy Greenfield,et al.  Using DNA microarrays. , 2008, Methods in molecular biology.

[54]  Theo Stijnen,et al.  Random effects meta‐analysis of event outcome in the framework of the generalized linear mixed model with applications in sparse data , 2010, Statistics in medicine.

[55]  Richard Simon,et al.  Probabilistic classifiers with high-dimensional data. , 2011, Biostatistics.