Diagnostics and correction of batch effects in large‐scale proteomic studies: a tutorial

Advancements in mass spectrometry‐based proteomics have enabled experiments encompassing hundreds of samples. While these large sample sets deliver much‐needed statistical power, handling them introduces technical variability known as batch effects. Here, we present a step‐by‐step protocol for the assessment, normalization, and batch correction of proteomic data. We review established methodologies from related fields and describe solutions specific to proteomic challenges, such as ion intensity drift and missing values in quantitative feature matrices. Finally, we compile a set of techniques that enable control of batch effect adjustment quality. We provide an R package, "proBatch", containing functions required for each step of the protocol. We demonstrate the utility of this methodology on five proteomic datasets each encompassing hundreds of samples and consisting of multiple experimental designs. In conclusion, we provide guidelines and tools to make the extraction of true biological signal from large proteomic studies more robust and transparent, ultimately facilitating reliable and reproducible research in clinical proteomics and systems biology.

[1]  T. A. Bell,et al.  Regulation of protein abundance in genetically diverse mouse populations , 2020, bioRxiv.

[2]  Robert W. Williams,et al.  Multi-Omic Profiling of the Liver Across Diets and Age in a Diverse Mouse Population AUTHOR LIST , 2021 .

[3]  Meng Wang,et al.  RobNorm: model-based robust normalization method for labeled quantitative mass spectrometry proteomics data , 2019, Bioinform..

[4]  Angus I. Lamond,et al.  Multibatch TMT Reveals False Positives, Batch Effects and Missing Values* , 2019, Molecular & Cellular Proteomics.

[5]  W. Langhans,et al.  Roux-en-Y gastric bypass surgery reprograms enterocyte triglyceride metabolism and postprandial secretion in rats , 2019, Molecular metabolism.

[6]  Ryan R Brinkman,et al.  Dynamic molecular changes during the first week of human life follow a robust developmental trajectory , 2019, Nature Communications.

[7]  Olga Vitek,et al.  Comparison of Protein Quantification in a Complex Background by DIA and TMT Workflows with Fixed Instrument Time. , 2019, Journal of proteome research.

[8]  Evan G. Williams,et al.  Similarities and Differences of Blood N-Glycoproteins in Five Solid Carcinomas at Localized Clinical Stage Analyzed by SWATH-MS. , 2018, Cell reports.

[9]  Hemi Luan,et al.  Quality control-based signal drift correction and interpretations of metabolomics/proteomics data using random forest regression , 2018, bioRxiv.

[10]  Ben C. Collins,et al.  Quantitative proteomics: challenges and opportunities in basic and applied research , 2017, Nature Protocols.

[11]  Limsoon Wong,et al.  Why Batch Effects Matter in Omics Data, and How to Avoid Them. , 2017, Trends in biotechnology.

[12]  A. Ciliberto,et al.  Missing Value Monitoring Enhances the Robustness in Proteomics Quantitation. , 2017, Journal of proteome research.

[13]  Laura L. Elo,et al.  A systematic evaluation of normalization methods in quantitative label-free proteomics , 2016, Briefings Bioinform..

[14]  Brett Larsen,et al.  Multi-laboratory assessment of reproducibility, qualitative and quantitative performance of SWATH-mass spectrometry , 2016, bioRxiv.

[15]  Johanna Hardin,et al.  Selecting between‐sample RNA‐Seq normalization methods from the perspective of their assumptions , 2016, Briefings Bioinform..

[16]  Ruedi Aebersold,et al.  Proteome-wide association studies identify biochemical modules associated with a wing-size phenotype in Drosophila melanogaster , 2016, Nature Communications.

[17]  Lars Malmström,et al.  TRIC: an automated alignment strategy for reproducible protein quantification in targeted proteomics , 2016, Nature Methods.

[18]  Marco Y. Hein,et al.  The Perseus computational platform for comprehensive analysis of (prote)omics data , 2016, Nature Methods.

[19]  Evan G. Williams,et al.  Systems proteomics of liver mitochondria function , 2016, Science.

[20]  Ronald J. Moore,et al.  Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer , 2016, Cell.

[21]  Michael L. Gatza,et al.  Proteogenomics connects somatic mutations to signaling in breast cancer , 2016, Nature.

[22]  Hao Li,et al.  Normalization and integration of large-scale metabolomics data using support vector regression , 2016, Metabolomics.

[23]  Hyungwon Choi,et al.  mapDIA: Preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry. , 2015, Journal of proteomics.

[24]  Yoav Gilad,et al.  A reanalysis of mouse ENCODE comparative gene expression data , 2015, F1000Research.

[25]  Ruedi Aebersold,et al.  Quantitative variability of 342 plasma proteins in a human twin population , 2015 .

[26]  Brendan MacLean,et al.  MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments , 2014, Bioinform..

[27]  Jeffrey R. Whiteaker,et al.  Proteogenomic characterization of human colon and rectal cancer , 2014, Nature.

[28]  A. Chawade,et al.  Normalyzer: A Tool for Rapid Evaluation of Normalization Methods for Omics Data Sets , 2014, Journal of proteome research.

[29]  Lars Malmström,et al.  aLFQ: an R-package for estimating absolute protein quantities from label-free LC-MS/MS proteomics data , 2014, Bioinform..

[30]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[31]  Hugues Bersini,et al.  Batch effect removal methods for microarray gene expression data integration: a survey , 2013, Briefings Bioinform..

[32]  Alan R. Dabney,et al.  Normalization and missing value imputation for label-free LC-MS analysis , 2012, BMC Bioinformatics.

[33]  Ruedi Aebersold,et al.  Statistical protein quantification and significance analysis in label-free LC-MS experiments with complex designs , 2012, BMC Bioinformatics.

[34]  Josep Villanueva,et al.  Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics. , 2012, Journal of proteomics.

[35]  John Chilton,et al.  Using iRT, a normalized retention time for more targeted measurement of peptides , 2012, Proteomics.

[36]  Ruedi Aebersold,et al.  Range of protein detection by selected/multiple reaction monitoring mass spectrometry in an unfractionated human cell culture lysate , 2012, Proteomics.

[37]  Philge Philip,et al.  Normalization of High Dimensional Genomics Data Where the Distribution of the Altered Variables Is Skewed , 2011, PloS one.

[38]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[39]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[40]  Tieliu Shi,et al.  A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data , 2010, The Pharmacogenomics Journal.

[41]  Olga Vitek,et al.  Statistical design of quantitative mass spectrometry-based proteomic experiments. , 2009, Journal of proteome research.

[42]  Andrew B. Nobel,et al.  Merging two gene-expression studies via cross-platform normalization , 2008, Bioinform..

[43]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[44]  Jeffrey S. Morris,et al.  The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. , 2005, Briefings in functional genomics & proteomics.

[45]  Robert W. Williams,et al.  A new set of BXD recombinant inbred lines from advanced intercross populations in mice , 2004, BMC Genetics.

[46]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[47]  Douglas M. Hawkins,et al.  A variance-stabilizing transformation for gene-expression microarray data , 2002, ISMB.

[48]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[49]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[50]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[51]  Ruedi Aebersold,et al.  Review of Batch Effects Prevention, Diagnostics, and Correction Approaches. , 2020, Methods in molecular biology.

[52]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[53]  Joel S. Parker,et al.  Adjustment of systematic microarray data biases , 2004, Bioinform..