Improved batch correction in untargeted MS-based metabolomics

AbstractIntroductionBatch effects in large untargeted metabolomics experiments are almost unavoidable, especially when sensitive detection techniques like mass spectrometry (MS) are employed. In order to obtain peak intensities that are comparable across all batches, corrections need to be performed. Since non-detects, i.e., signals with an intensity too low to be detected with certainty, are common in metabolomics studies, the batch correction methods need to take these into account. ObjectivesThis paper aims to compare several batch correction methods, and investigates the effect of different strategies for handling non-detects.MethodsBatch correction methods usually consist of regression models, possibly also accounting for trends within batches. To fit these models quality control samples (QCs), injected at regular intervals, can be used. Also study samples can be used, provided that the injection order is properly randomized. Normalization methods, not using information on batch labels or injection order, can correct for batch effects as well. Introducing two easy-to-use quality criteria, we assess the merits of these batch correction strategies using three large LC–MS and GC–MS data sets of samples from Arabidopsis thaliana.ResultsThe three data sets have very different characteristics, leading to clearly distinct behaviour of the batch correction strategies studied. Explicit inclusion of information on batch and injection order in general leads to very good corrections; when enough QCs are available, also general normalization approaches perform well. Several approaches are shown to be able to handle non-detects—replacing them with very small numbers such as zero seems the worst of the approaches considered.ConclusionThe use of quality control samples for batch correction leads to good results when enough QCs are available. If an experiment is properly set up, batch correction using the study samples usually leads to a similar high-quality correction, but has the advantage that more metabolites are corrected. The strategy for handling non-detects is important: choosing small values like zero can lead to suboptimal batch corrections.

[1]  P. J. Flood,et al.  Natural genetic variation in Arabidopsis thaliana photosynthesis , 2015 .

[2]  Achim Zeileis,et al.  Applied Econometrics with R , 2008 .

[3]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[4]  Andrey Ziyatdinov,et al.  Intensity drift removal in LC/MS metabolomics by common variance compensation , 2014, Bioinform..

[5]  T. Ebbels,et al.  Optimizing the use of quality control samples for signal drift correction in large-scale urine metabolic profiling studies. , 2012, Analytical chemistry.

[6]  Maria Yazdanbakhsh,et al.  Evaluation of regression methods when immunological measurements are constrained by detection limits , 2008, BMC Immunology.

[7]  Charmion Cruickshank-Quinn,et al.  MSPrep - Summarization, normalization and diagnostics for processing of mass spectrometry-based metabolomic data , 2014, Bioinform..

[8]  A. Auton,et al.  Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel , 2011, Nature Genetics.

[9]  T. Ebbels,et al.  Optimized preprocessing of ultra-performance liquid chromatography/mass spectrometry urinary metabolic profiles for improved information recovery. , 2011, Analytical chemistry.

[10]  Roland Mumm,et al.  Gomphrena claussenii, a novel metal-hypertolerant bioindicator species, sequesters cadmium, but not zinc, in vacuolar oxalate crystals. , 2015, The New phytologist.

[11]  Y. M. Tikunov,et al.  MSClust: a tool for unsupervised mass spectra extraction of chromatography-mass spectrometry ion-wise aligned data , 2011, Metabolomics.

[12]  Arjen Lommen,et al.  MetAlign: interface-driven, versatile metabolomics tool for hyphenated full-scan mass spectrometry data preprocessing. , 2009, Analytical chemistry.

[13]  Joy Bergelson,et al.  Association mapping of local climate-sensitive quantitative trait loci in Arabidopsis thaliana , 2010, Proceedings of the National Academy of Sciences.

[14]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[15]  Doris M Jacobs,et al.  Rapid and sustained systemic circulation of conjugated gut microbial catabolites after single-dose black tea extract consumption. , 2014, Journal of proteome research.

[16]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[17]  Roland Mumm,et al.  Diversity and functions of volatile organic compounds produced by Streptomyces from a disease-suppressive soil , 2015, Front. Microbiol..

[18]  Joshua D. Knowles,et al.  Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry , 2011, Nature Protocols.

[19]  David S. Wishart,et al.  MetaboAnalyst 3.0—making metabolomics more meaningful , 2015, Nucleic Acids Res..

[20]  Ron Wehrens,et al.  Chemometrics with R: Multivariate Data Analysis in the Natural Sciences and Life Sciences , 2011 .

[21]  Ivo Grosse,et al.  Experiment design beyond gut feeling: statistical tests and power to detect differential metabolites in mass spectrometry data , 2014, Metabolomics.

[22]  Oscar Yanes,et al.  Metabolomics: the apogee of the omics trilogy , 2012 .

[23]  J. Keurentjes,et al.  Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry , 2007, Nature Protocols.

[24]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[25]  Age K. Smilde,et al.  Data-processing strategies for metabolomics studies , 2011 .

[26]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[27]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[28]  Ralf J. M. Weber,et al.  Mass appeal: metabolite identification in mass spectrometry-focused untargeted metabolomics , 2012, Metabolomics.

[29]  Marian Groenenboom,et al.  Metabolomics reveals organ-specific metabolic rearrangements during early tomato seedling development , 2014, Metabolomics.

[30]  J A Kirwan,et al.  Characterising and correcting batch variation in an automated direct infusion mass spectrometry (DIMS) metabolomics workflow , 2013, Analytical and Bioanalytical Chemistry.

[31]  Matthias Scholz,et al.  MetaDB a Data Processing Workflow in Untargeted MS-Based Metabolomics Experiments , 2014, Front. Bioeng. Biotechnol..

[32]  Johann A. Gagnon-Bartsch,et al.  Statistical methods for handling unwanted variation in metabolomics data. , 2015, Analytical chemistry.

[33]  R. D. Hall,et al.  Multi-platform metabolomics analyses of a broad collection of fragrant and non-fragrant rice varieties reveals the high complexity of grain quality characteristics , 2016, Metabolomics.

[34]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[35]  Robert D Hall,et al.  Solid phase micro-extraction GC-MS analysis of natural volatile components in melon and rice. , 2012, Methods in molecular biology.

[36]  G. Siuzdak,et al.  Innovation: Metabolomics: the apogee of the omics trilogy , 2012, Nature Reviews Molecular Cell Biology.

[37]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[38]  R. Wehrens Chemometrics with R , 2020, Use R!.

[39]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[40]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[41]  J. Tobin Estimation of Relationships for Limited Dependent Variables , 1958 .

[42]  J. Meulman,et al.  Equating, or correction for between-block effects with application to body fluid LC-MS and NMR metabolomics data sets. , 2010, Analytical chemistry.

[43]  T. Hankemeier,et al.  Comprehensive metabolomics to evaluate the impact of industrial processing on the phytochemical composition of vegetable purees. , 2015, Food chemistry.