Comparison of statistical methods and the use of quality control samples for batch effect correction in human transcriptome data

Batch effects are technical sources of variation introduced by the necessity of conducting gene expression analyses on different dates due to the large number of biological samples in population-based studies. The aim of this study is to evaluate the performances of linear mixed models (LMM) and Combat in batch effect removal. We also assessed the utility of adding quality control samples in the study design as technical replicates. In order to do so, we simulated gene expression data by adding “treatment” and batch effects to a real gene expression dataset. The performances of LMM and Combat, with and without quality control samples, are assessed in terms of sensitivity and specificity while correcting for the batch effect using a wide range of effect sizes, statistical noise, sample sizes and level of balanced/unbalanced designs. The simulations showed small differences among LMM and Combat. LMM identifies stronger relationships between big effect sizes and gene expression than Combat, while Combat identifies in general more true and false positives than LMM. However, these small differences can still be relevant depending on the research goal. When any of these methods are applied, quality control samples did not reduce the batch effect, showing no added value for including them in the study design.

[1]  Albert Kriegner,et al.  Monitoring of Technical Variation in Quantitative High-Throughput Datasets , 2013, Cancer informatics.

[2]  C Sacerdote,et al.  Prediagnostic transcriptomic markers of Chronic lymphocytic leukemia reveal perturbations 10 years before diagnosis. , 2014, Annals of oncology : official journal of the European Society for Medical Oncology.

[3]  Yufeng Liu,et al.  R/DWD: distance-weighted discrimination for classification, visualization and batch adjustment , 2012, Bioinform..

[4]  E. Hovig,et al.  Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses , 2015, Biostatistics.

[5]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[6]  F. van Nieuwerburgh,et al.  Library construction for next-generation sequencing: overviews and challenges. , 2014, BioTechniques.

[7]  J. Hare,et al.  The use of transcriptomic biomarkers for personalized medicine , 2007, Heart Failure Reviews.

[8]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[9]  Tieliu Shi,et al.  A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data , 2010, The Pharmacogenomics Journal.

[10]  David P. Kreil,et al.  Physico-chemical foundations underpinning microarray and next-generation sequencing experiments , 2013, Nucleic acids research.

[11]  C. Gieger,et al.  Analyzing Illumina Gene Expression Microarray Data from Different Tissues: Methodological Aspects of Data Analysis in the MetaXpress Consortium , 2012, PloS one.

[12]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[13]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[14]  M. Markatou,et al.  Evaluation of Methods in Removing Batch Effects on RNA-seq Data , 2016 .

[15]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Martin Kussmann,et al.  OMICS-driven biomarker discovery in nutrition and health. , 2006, Journal of biotechnology.

[17]  Kristin E. Porter,et al.  Global Gene Expression Profiling of a Population Exposed to a Range of Benzene Levels , 2010, Environmental health perspectives.

[18]  E. Bijnens,et al.  Corrigendum Cohort Profile: The ENVIRonmental influence ON early AGEing (ENVIRONAGE): a birth cohort study. , 2017, International journal of epidemiology.

[19]  Roman Jaksik,et al.  Microarray experiments and factors which affect their reliability , 2015, Biology Direct.

[20]  Harald Binder,et al.  Removing Batch Effects from Longitudinal Gene Expression - Quantile Normalization Plus ComBat as Best Approach for Microarray Transcriptome Data , 2016, PloS one.

[21]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[22]  J. Leek svaseq: removing batch effects and other unwanted noise from sequencing data , 2014, bioRxiv.

[23]  Crispin J. Miller,et al.  The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets – improving meta-analysis and prediction of prognosis , 2008, BMC Medical Genomics.

[24]  E. Bijnens,et al.  Cohort Profile: The ENVIRonmental influence ON early AGEing (ENVIRONAGE): a birth cohort study. , 2017, International journal of epidemiology.