Why Batch Effects Matter in Omics Data, and How to Avoid Them.

Effective integration and analysis of new high-throughput data, especially gene-expression and proteomic-profiling data, are expected to deliver novel clinical insights and therapeutic options. Unfortunately, technical heterogeneity or batch effects (different experiment times, handlers, reagent lots, etc.) have proven challenging. Although batch effect-correction algorithms (BECAs) exist, we know little about effective batch-effect mitigation: even now, new batch effect-associated problems are emerging. These include false effects due to misapplying BECAs and positive bias during model evaluations. Depending on the choice of algorithm and experimental set-up, biological heterogeneity can be mistaken for batch effects and wrongfully removed. Here, we examine these emerging batch effect-associated problems, propose a series of best practices, and discuss some of the challenges that lie ahead.

[1]  Reinhard Guthke,et al.  Batch correction of microarray data substantially improves the identification of genes differentially expressed in Rheumatoid Arthritis and Osteoarthritis , 2012, BMC Medical Genomics.

[2]  Josep Villanueva,et al.  Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics. , 2012, Journal of proteomics.

[3]  Jeffrey T. Leek,et al.  Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction , 2014, Bioinform..

[4]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[5]  Douglas A. Lauffenburger,et al.  Normalization and Statistical Analysis of Multiplexed Bead-based Immunoassay Data Using Mixed-effects Modeling* , 2012, Molecular & Cellular Proteomics.

[6]  Limsoon Wong,et al.  Advancing Clinical Proteomics via Analysis Based on Biological Complexes: A Tale of Five Paradigms. , 2016, Journal of proteome research.

[7]  E. Hovig,et al.  Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses , 2015, Biostatistics.

[8]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[9]  Andrew E. Jaffe,et al.  Erratum to: Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis , 2015, BMC Bioinformatics.

[10]  Harald Binder,et al.  Removing Batch Effects from Longitudinal Gene Expression - Quantile Normalization Plus ComBat as Best Approach for Microarray Transcriptome Data , 2016, PloS one.

[11]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[12]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[13]  Terence P. Speed,et al.  Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed , 2012, Biostatistics.

[14]  Charlotte Soneson,et al.  Batch Effect Confounding Leads to Strong Bias in Performance Estimates Obtained by Cross-Validation , 2014, PloS one.

[15]  Huei-Chung Huang,et al.  Cautionary Note on Using Cross-Validation for Molecular Classification. , 2016, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[16]  David Venet,et al.  Most Random Gene Expression Signatures Are Significantly Associated with Breast Cancer Outcome , 2011, PLoS Comput. Biol..

[17]  Qin Zhou,et al.  Blocking and Randomization to Improve Molecular Biomarker Discovery , 2014, Clinical Cancer Research.

[18]  Mario Medvedovic,et al.  Stratified randomization controls better for batch effects in 450K methylation analysis: a cautionary tale , 2014, Front. Genet..

[19]  L. Wong,et al.  Protein complex-based analysis is resistant to the obfuscating consequences of batch effects --- a case study in clinical proteomics , 2017, BMC Genomics.

[20]  Limsoon Wong,et al.  GFS: fuzzy preprocessing for effective gene expression analysis , 2016, BMC Bioinformatics.