Experiment design beyond gut feeling: statistical tests and power to detect differential metabolites in mass spectrometry data

Univariate hypotheses tests such as Student’s t test or variance analysis (ANOVA) can help to answer a variety of questions in metabolomics data analysis. The statistical power of these tests depends on the setup of the experiment, the experimental design and the analytical variance of the actual observations. In this paper, we demonstrate how a well-designed pilot study prior to an experiment with the aim to find differences between e.g. several genotypes, can help to determine the variance at multiple levels ranging from biological variance, sample preparation to instrumental variances. Next, we illustrate how these variances can be used to obtain several parameters (e.g. minimum statistically significant effect, number of required replicates and error probabilities) which influence the design of the actual study. In particular, we are going to sketch how technical replicates can improve the performance of a test, when they are correctly used in the statistical analysis, e.g. with a hierarchical model. Finally, we demonstrate the process of evaluating the trade-off between different experimental designs with different replication strategies. The choice of an experimental design beyond the gut feeling can be influenced by factors such as costs, sample availability and the accuracy of of the tests. We use metabolite profiles of the model plant Arabidopsis thaliana measured on an UPLC-ESI/QqTOF-MS as real-world dataset, but the approach is equally applicable to other sample types and measurement methods like NMR based metabolomics.

[1]  Ludger Wessjohann,et al.  Profiling of Arabidopsis Secondary Metabolites by Capillary Liquid Chromatography Coupled to Electrospray Ionization Quadrupole Time-of-Flight Mass Spectrometry1 , 2004, Plant Physiology.

[2]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[3]  Christoph Steinbeck,et al.  MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data , 2012, Nucleic Acids Res..

[4]  A Donner,et al.  Statistical considerations in the design and analysis of community intervention trials. , 1996, Journal of clinical epidemiology.

[5]  R. Abagyan,et al.  XCMS: processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching, and identification. , 2006, Analytical chemistry.

[6]  Douglas B. Kell,et al.  Statistical strategies for avoiding false discoveries in metabolomics and related experiments , 2007, Metabolomics.

[7]  Ralf J. M. Weber,et al.  Mass appeal: metabolite identification in mass spectrometry-focused untargeted metabolomics , 2012, Metabolomics.

[8]  Age K. Smilde,et al.  Reflections on univariate and multivariate analysis of metabolomics data , 2013, Metabolomics.

[9]  D. Scheel,et al.  Resources for Metabolomics , 2011 .

[10]  L. Fahrmeir,et al.  Multivariate statistische Verfahren , 1984 .

[11]  Anthony S. Bryk,et al.  Hierarchical Linear Models: Applications and Data Analysis Methods , 1992 .

[12]  D. R. Causton,et al.  The application of MANOVA to analyse Arabidopsis thaliana metabolomic data from factorially designed experiments , 2007, Metabolomics.

[13]  Douglas B. Kell,et al.  Proposed minimum reporting standards for data analysis in metabolomics , 2007, Metabolomics.

[14]  D. Bates,et al.  Mixed-Effects Models in S and S-PLUS , 2001 .

[15]  James R. Kenyon,et al.  Statistical Methods for the Analysis of Repeated Measurements , 2003, Technometrics.

[16]  Kathryn S Lilley,et al.  Impact of replicate types on proteomic expression analysis. , 2005, Journal of proteome research.

[17]  P. Spégel,et al.  Development of a gas chromatography/mass spectrometry based metabolomics protocol by means of statistical experimental design , 2011, Metabolomics.

[18]  Harvey Goldstein,et al.  Multilevel modelling of health statistics , 2001 .

[19]  William Stafford Noble,et al.  The effect of replication on gene expression microarray experiments , 2003, Bioinform..

[20]  Age K. Smilde,et al.  Data-processing strategies for metabolomics studies , 2011 .

[21]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[22]  Jordi Duran,et al.  A Guideline to Univariate Statistical Analysis for LC/MS-Based Untargeted Metabolomics-Derived Data , 2012, Metabolites.

[23]  T. Holmes,et al.  Ten categories of statistical errors: a guide for research in endocrinology and metabolism. , 2004, American journal of physiology. Endocrinology and metabolism.

[24]  Warwick B Dunn,et al.  Current trends and future requirements for the mass spectrometric investigation of microbial, mammalian and plant metabolomes , 2008, Physical biology.

[25]  Erik Johansson,et al.  Strategy for optimizing LC-MS data processing in metabolomics: a design of experiments approach. , 2012, Analytical chemistry.

[26]  Tom A. B. Snijders,et al.  Power and Sample Size in Multilevel Linear Models , 2005 .

[27]  D. Scheel,et al.  The Multifunctional Enzyme CYP71B15 (PHYTOALEXIN DEFICIENT3) Converts Cysteine-Indole-3-Acetonitrile to Camalexin in the Indole-3-Acetonitrile Metabolic Network of Arabidopsis thaliana[W][OA] , 2009, The Plant Cell Online.

[28]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[29]  G. Horgan,et al.  Sample size and replication in 2D gel electrophoresis studies. , 2007, Journal of proteome research.

[30]  Wei Zheng,et al.  Metabolomics in Epidemiology: Sources of Variability in Metabolite Measurements and Implications , 2013, Cancer Epidemiology, Biomarkers & Prevention.