Reflections on univariate and multivariate analysis of metabolomics data

AbstractMetabolomics experiments usually result in a large quantity of data. Univariate and multivariate analysis techniques are routinely used to extract relevant information from the data with the aim of providing biological knowledge on the problem studied. Despite the fact that statistical tools like the t test, analysis of variance, principal component analysis, and partial least squares discriminant analysis constitute the backbone of the statistical part of the vast majority of metabolomics papers, it seems that many basic but rather fundamental questions are still often asked, like: Why do the results of univariate and multivariate analyses differ? Why apply univariate methods if you have already applied a multivariate method? Why if I do not see something univariately I see something multivariately? In the present paper we address some aspects of univariate and multivariate analysis, with the scope of clarifying in simple terms the main differences between the two approaches. Applications of the t test, analysis of variance, principal component analysis and partial least squares discriminant analysis will be shown on both real and simulated metabolomics data examples to provide an overview on fundamental aspects of univariate and multivariate methods.

[1]  Age K Smilde,et al.  A Critical Assessment of Feature Selection Methods for Biomarker Discovery in Clinical Proteomics* , 2012, Molecular & Cellular Proteomics.

[2]  Age K Smilde,et al.  Assessing the metabolic effects of prednisolone in healthy volunteers using urine metabolic profiling , 2012, Genome Medicine.

[3]  Herbert Pang,et al.  Recent Advances in Discriminant Analysis for High-dimensional DataClassification , 2012 .

[4]  David Thissen,et al.  Quick and Easy Implementation of the Benjamini-Hochberg Procedure for Controlling the False Positive Rate in Multiple Comparisons , 2002 .

[5]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Age K. Smilde,et al.  Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies , 2011, Metabolomics.

[7]  H. Senn,et al.  Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. , 2006, Analytical chemistry.

[8]  N. Reo NMR-BASED METABOLOMICS , 2002, Drug and chemical toxicology.

[9]  Ji Zhu,et al.  Improved centroids estimation for the nearest shrunken centroid classifier , 2007, Bioinform..

[10]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[11]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[12]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[13]  Age K Smilde,et al.  Multilevel data analysis of a crossover designed human nutritional intervention study. , 2008, Journal of proteome research.

[14]  Pierre Dardenne,et al.  Validation and verification of regression in small data sets , 1998 .

[15]  M. Rantalainen,et al.  OPLS discriminant analysis: combining the strengths of PLS‐DA and SIMCA classification , 2006 .

[16]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[17]  David M. Rocke,et al.  Discrimination models using variance-stabilizing transformation of metabolomic NMR data. , 2004, Omics : a journal of integrative biology.

[18]  Mark R. Viant,et al.  Missing values in mass spectrometry based metabolomics: an undervalued step in the data processing pipeline , 2011, Metabolomics.

[19]  Genevera I. Allen,et al.  Sparse non-negative generalized PCA with applications to metabolomics , 2011, Bioinform..

[20]  R. Brereton Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data , 2006 .

[21]  Monique G M de Sain-van der Velden,et al.  The Proline/Citrulline Ratio as a Biomarker for OAT Deficiency in Early Infancy. , 2012, JIMD reports.

[22]  Age K. Smilde,et al.  Data-processing strategies for metabolomics studies , 2011 .

[23]  Ben Ernest,et al.  MetabR: an R script for linear model analysis of quantitative metabolomic data , 2012, BMC Research Notes.

[24]  Douglas B. Kell,et al.  Statistical strategies for avoiding false discoveries in metabolomics and related experiments , 2007, Metabolomics.

[25]  Royston Goodacre,et al.  Integrating multiple analytical platforms and chemometrics for comprehensive metabolic profiling: application to meat spoilage detection , 2013, Analytical and Bioanalytical Chemistry.

[26]  J. J. Jansen,et al.  Metabolomic analysis of the interaction between plants and herbivores , 2009, Metabolomics.

[27]  Wei-Hao Wang,et al.  Studies , 1926 .

[28]  David S. Wishart,et al.  MetaboAnalyst 2.0—a comprehensive server for metabolomic data analysis , 2012, Nucleic Acids Res..

[29]  Age K. Smilde,et al.  Assessing the performance of statistical validation tools for megavariate metabolomics data , 2006, Metabolomics.

[30]  J. Friedman Regularized Discriminant Analysis , 1989 .

[31]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[32]  Ron Wehrens,et al.  Thresholding for biomarker selection in multivariate data using Higher Criticism. , 2012, Molecular bioSystems.

[33]  Anton S. Shiriaev,et al.  Pair-wise multicomparison and OPLS analyses of cold-acclimation phases in Siberian spruce , 2011, Metabolomics.

[34]  Rasmus Bro,et al.  Some common misunderstandings in chemometrics , 2010 .

[35]  Y. Benjamini,et al.  More powerful procedures for multiple significance testing. , 1990, Statistics in medicine.

[36]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[37]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[38]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[39]  Pietro Franceschi,et al.  A benchmark spike‐in data set for biomarker identification in metabolomics , 2012 .

[40]  Age K Smilde,et al.  Global test for metabolic pathway differences between conditions. , 2012, Analytica chimica acta.

[41]  Ian T. Jolliffe,et al.  PRINCIPAL COMPONENT ANALYSIS: A BEGINNER'S GUIDE — I. Introduction and application , 1990 .

[42]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[43]  Christian Gieger,et al.  On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies , 2012, BMC Bioinformatics.

[44]  Kåre I. Birkeland,et al.  Metabolic Changes in Urine during and after Pregnancy in a Large, Multiethnic Population-Based Cohort Study of Gestational Diabetes , 2012, PloS one.

[45]  Age K Smilde,et al.  The photographer and the greenhouse: how to analyse plant metabolomics data. , 2010, Phytochemical analysis : PCA.

[46]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[47]  Peter de B. Harrington,et al.  Statistical validation of classification and calibration models using bootstrapped latin partitions , 2006 .

[48]  James K. Ellis,et al.  Metabolic profiling detects early effects of environmental and lifestyle exposure to cadmium in a human population , 2012, BMC Medicine.

[49]  T. Ebbels,et al.  Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts , 2007, Nature Protocols.

[50]  Flemming Jessen,et al.  Combination of statistical approaches for analysis of 2-DE data gives complementary results. , 2008, Journal of proteome research.

[51]  T. Ebbels,et al.  Geometric trajectory analysis of metabolic responses to toxicity can define treatment specific profiles. , 2004, Chemical research in toxicology.

[52]  J. Lindon,et al.  Scaling and normalization effects in NMR spectroscopic metabonomic data sets. , 2006, Analytical chemistry.

[53]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[54]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[55]  John D. Storey A direct approach to false discovery rates , 2002 .

[56]  A. Smilde,et al.  A lipidomic analysis approach to evaluate the response to cholesterol-lowering food intake , 2011, Metabolomics.

[57]  Mark R Viant,et al.  NMR-based metabolomics: a powerful approach for characterizing the effects of environmental stressors on organism health. , 2003, Environmental science & technology.

[58]  Johan Trygg,et al.  Chemometrics in metabonomics. , 2007, Journal of proteome research.

[59]  D. Donoho,et al.  Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[60]  Age K. Smilde,et al.  Simplivariate Models: Ideas and First Examples , 2008, PloS one.

[61]  Age K. Smilde,et al.  Simplivariate Models: Uncovering the Underlying Biology in Functional Genomics Data , 2011, PloS one.

[62]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[63]  Kathrin Klamroth,et al.  Introduction and Applications , 2000 .