Non-targeted UHPLC-MS metabolomic data processing methods: a comparative investigation of normalisation, missing value imputation, transformation and scaling

IntroductionThe generic metabolomics data processing workflow is constructed with a serial set of processes including peak picking, quality assurance, normalisation, missing value imputation, transformation and scaling. The combination of these processes should present the experimental data in an appropriate structure so to identify the biological changes in a valid and robust manner.ObjectivesCurrently, different researchers apply different data processing methods and no assessment of the permutations applied to UHPLC-MS datasets has been published. Here we wish to define the most appropriate data processing workflow.MethodsWe assess the influence of normalisation, missing value imputation, transformation and scaling methods on univariate and multivariate analysis of UHPLC-MS datasets acquired for different mammalian samples.ResultsOur studies have shown that once data are filtered, missing values are not correlated with m/z, retention time or response. Following an exhaustive evaluation, we recommend PQN normalisation with no missing value imputation and no transformation or scaling for univariate analysis. For PCA we recommend applying PQN normalisation with Random Forest missing value imputation, glog transformation and no scaling method. For PLS-DA we recommend PQN normalisation, KNN as the missing value imputation method, generalised logarithm transformation and no scaling. These recommendations are based on searching for the biologically important metabolite features independent of their measured abundance.ConclusionThe appropriate choice of normalisation, missing value imputation, transformation and scaling methods differs depending on the data analysis method and the choice of method is essential to maximise the biological derivations from UHPLC-MS datasets.

[1]  Grégory Genta-Jouve,et al.  Comparative LC-MS-based metabolite profiling of the ancient tropical rainforest tree Symphonia globulifera. , 2014, Phytochemistry.

[2]  R. Spang,et al.  State-of-the art data normalization methods improve NMR-based metabolomic analysis , 2011, Metabolomics.

[3]  T. Speed,et al.  Normalizing and integrating metabolomics data. , 2012, Analytical chemistry.

[4]  R. Abagyan,et al.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. , 2006, Analytical chemistry.

[5]  T. Hankemeier,et al.  Comprehensive metabolomics to evaluate the impact of industrial processing on the phytochemical composition of vegetable purees. , 2015, Food chemistry.

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  Tytus D. Mak,et al.  MetaboLyzer: a novel statistical workflow for analyzing Postprocessed LC-MS metabolomics data. , 2014, Analytical chemistry.

[8]  Alexander Basilevsky Wiley Series in Probability and Mathematical Statistics , 2008 .

[9]  A. Smilde,et al.  Fusion of mass spectrometry-based metabolomics data. , 2005, Analytical chemistry.

[10]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[11]  Isobel Claire Gormley,et al.  Probabilistic principal component analysis for metabolomic data , 2010, BMC Bioinformatics.

[12]  Matej Oresic,et al.  MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data , 2006, Bioinform..

[13]  H. Senn,et al.  Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. , 2006, Analytical chemistry.

[14]  Matej Oresic,et al.  Normalization method for metabolomics data using optimal selection of multiple internal standards , 2007, BMC Bioinformatics.

[15]  D. Kell,et al.  Metabolic profiling of serum using Ultra Performance Liquid Chromatography and the LTQ-Orbitrap mass spectrometry system. , 2008, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[16]  Timothy D. Veenstra,et al.  LC‐MS in Metabonomics: Optimization of Experimental Conditions for the Analysis of Metabolites in Human Urine , 2006 .

[17]  Xin Lu,et al.  A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis , 2015, Front. Mol. Biosci..

[18]  J. W. Allwood,et al.  1H NMR, GC-EI-TOFMS, and data set correlation for fruit metabolomics: application to spatial metabolite analysis in melon. , 2009, Analytical chemistry.

[19]  Lyle Burton,et al.  Investigation of analytical variation in metabonomic analysis using liquid chromatography/mass spectrometry. , 2007, Rapid communications in mass spectrometry : RCM.

[20]  Frans M van der Kloet,et al.  A new approach to untargeted integration of high resolution liquid chromatography-mass spectrometry data. , 2013, Analytica chimica acta.

[21]  Daniel Jacob,et al.  Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics , 2014, Bioinform..

[22]  Mark R. Viant,et al.  Improved classification accuracy in 1- and 2-dimensional NMR metabolomics data using the variance stabilising generalised logarithm transformation , 2007, BMC Bioinformatics.

[23]  R. Little A Test of Missing Completely at Random for Multivariate Data with Missing Values , 1988 .

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[26]  Robert Burke,et al.  ProteoWizard: open source software for rapid proteomics tools development , 2008, Bioinform..

[27]  Piotr S. Gromski,et al.  Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data , 2014, Metabolites.

[28]  Irena Spasic,et al.  A GC-TOF-MS study of the stability of serum and urine metabolomes during the UK Biobank sample collection and preparation protocols. , 2008, International journal of epidemiology.

[29]  J. Robben,et al.  Treatment of missing values for multivariate statistical analysis of gel‐based proteomics data , 2008, Proteomics.

[30]  Douglas B. Kell,et al.  The metabolome of human placental tissue: investigation of first trimester tissue and changes related to preeclampsia in late pregnancy , 2012, Metabolomics.

[31]  Joachim Selbig,et al.  A gentle guide to the analysis of metabolomic data. , 2007, Methods in molecular biology.

[32]  Florence I. Raynaud,et al.  Effect of sleep deprivation on the human metabolome , 2014, Proceedings of the National Academy of Sciences.

[33]  Jianguo Xia,et al.  Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst , 2011, Nature Protocols.

[34]  Jeremy K Nicholson,et al.  Technical and biological variation in UPLC-MS-based untargeted metabolic profiling of liver extracts: application in an experimental toxicity study on galactosamine. , 2011, Analytical chemistry.

[35]  Joshua D. Knowles,et al.  Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry , 2011, Nature Protocols.

[36]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[37]  Matej Oresic,et al.  MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data , 2010, BMC Bioinformatics.

[38]  Russell Pickford,et al.  Bimodal plasma metabolomics strategy identifies novel inflammatory metabolites in inflammatory bowel diseases. , 2014, Discovery medicine.

[39]  Ida Scheel,et al.  The influence of missing value imputation on detection of differentially expressed genes from microarray data , 2005, Bioinform..

[40]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[41]  A. Smilde,et al.  Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. , 2006, Analytical chemistry.

[42]  Ning Li,et al.  Recent developments in sample preparation and data pre-treatment in metabonomics research. , 2016, Archives of biochemistry and biophysics.

[43]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[44]  Leonardo Gobbo-Neto,et al.  Metabolomics as a Potential Chemotaxonomical Tool: Application in the Genus Vernonia Schreb , 2014, PloS one.

[45]  T. Ebbels,et al.  Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling , 2003 .

[46]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[47]  Mark R. Viant,et al.  Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline , 2011, Metabolomics.

[48]  Mark R. Viant,et al.  Galaxy-M: a Galaxy workflow for processing and analyzing direct infusion and liquid chromatography mass spectrometry-based metabolomics data , 2016, GigaScience.

[49]  R. Breitling,et al.  PeakML/mzMatch: a file format, Java library, R library, and tool-chain for mass spectrometry data analysis. , 2011, Analytical chemistry.