Combining location-and-scale batch effect adjustment with data cleaning by latent factor adjustment

BackgroundIn the context of high-throughput molecular data analysis it is common that the observations included in a dataset form distinct groups; for example, measured at different times, under different conditions or even in different labs. These groups are generally denoted as batches. Systematic differences between these batches not attributable to the biological signal of interest are denoted as batch effects. If ignored when conducting analyses on the combined data, batch effects can lead to distortions in the results. In this paper we present FAbatch, a general, model-based method for correcting for such batch effects in the case of an analysis involving a binary target variable. It is a combination of two commonly used approaches: location-and-scale adjustment and data cleaning by adjustment for distortions due to latent factors. We compare FAbatch extensively to the most commonly applied competitors on the basis of several performance metrics. FAbatch can also be used in the context of prediction modelling to eliminate batch effects from new test data. This important application is illustrated using real and simulated data. We implemented FAbatch and various other functionalities in the R package bapred available online from CRAN.ResultsFAbatch is seen to be competitive in many cases and above average in others. In our analyses, the only cases where it failed to adequately preserve the biological signal were when there were extremely outlying batches and when the batch effects were very weak compared to the biological signal.ConclusionsAs seen in this paper batch effect structures found in real datasets are diverse. Current batch effect adjustment methods are often either too simplistic or make restrictive assumptions, which can be violated in real datasets. Due to the generality of its underlying model and its ability to perform well FAbatch represents a reliable tool for batch effect adjustment for most situations found in practice.

[1]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[2]  Jeongyoun Ahn,et al.  Covariance adjustment for batch effect in gene expression data , 2014, Statistics in medicine.

[3]  Anne-Laure Boulesteix,et al.  Ten Simple Rules for Reducing Overoptimistic Reporting in Methodological Computational Research , 2015, PLoS Comput. Biol..

[4]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[5]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[6]  Michel Barlaud,et al.  High-Dimensional Statistical Measure for Region-of-Interest Tracking , 2009, IEEE Transactions on Image Processing.

[7]  Jeffrey T. Leek,et al.  Removing batch effects for prediction problems with frozen surrogate variable analysis , 2013, PeerJ.

[8]  Glen D Meeden,et al.  Fuzzy and randomized confidence intervals and P-values , 2005 .

[9]  A. Boulesteix PLS Dimension Reduction for Classification with Microarray Data , 2004, Statistical applications in genetics and molecular biology.

[10]  T. Chu,et al.  Principal Variance Components Analysis: Estimating Batch Effects in Microarray Gene Expression Data , 2009 .

[11]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[12]  J. N. S. Matthews,et al.  An introduction to randomized controlled clinical trials , 2000 .

[13]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[14]  John Crowley,et al.  Removing batch effects from purified plasma cell gene expression microarrays with modified ComBat , 2015, BMC Bioinformatics.

[15]  Andrew B. Nobel,et al.  Merging two gene-expression studies via cross-platform normalization , 2008, Bioinform..

[16]  Chloé Friguet,et al.  A Factor Model Approach to Multiple Testing Under Dependence , 2009 .

[17]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .

[18]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[19]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .