Harnessing the complexity of metabolomic data with chemometrics

Because of the ever-increasing number of signals that can be measured within a single run by modern platforms in analytical chemistry, life sciences datasets become not only gradually larger but also more intricate in their structures. Challenges related to making use of this wealth of data include extracting relevant elements within massive amounts of signals possibly spread across different tables, reducing dimensionality, summarising dynamic information in a comprehensible way and displaying it for interpretation purposes. Metabolomics constitutes a representative example of fast-moving research fields taking advantage of recent technological advances to provide extensive sample monitoring. Because of the wide chemical diversity of metabolites, several analytical setups are required to provide a broad coverage of complex samples. The integration and visualisation of multiple highly multivariate datasets constitute key issues for effective analysis leading to valuable biological or chemical knowledge. Additionally, high-order data structures arise from experimental setups involving time-resolved measurements. These data are intrinsically multiway, and classical statistical tools cannot be applied without altering their organisation with the risk of information loss. Dedicated modelling algorithms, able to cope with the inherent properties of these metabolomic datasets, are therefore mandatory for harnessing their complexity and provide relevant information. In that perspective, chemometrics has a central role to play. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Robert D Hall,et al.  A Role for Differential Glycoconjugation in the Emission of Phenylpropanoid Volatiles from Tomato Fruit Discovered Using a Metabolic Data Fusion Approach1[W][OA] , 2009, Plant Physiology.

[2]  M. Hirai,et al.  Decoding genes with coexpression networks and metabolomics - 'majority report by precogs'. , 2008, Trends in plant science.

[3]  Julien Boccard,et al.  A consensus orthogonal partial least squares discriminant analysis (OPLS-DA) strategy for multiblock Omics data fusion. , 2013, Analytica chimica acta.

[4]  Age K Smilde,et al.  Global test for metabolic pathway differences between conditions. , 2012, Analytica chimica acta.

[5]  I. Colquhoun Use of NMR for metabolic profiling in plant systems , 2007 .

[6]  Michel Tenenhaus,et al.  PLS path modeling , 2005, Comput. Stat. Data Anal..

[7]  Johan A. K. Suykens,et al.  A kernel-based framework to tensorial data analysis , 2011, Neural Networks.

[8]  Bart De Moor,et al.  Kernel-based Data Fusion for Machine Learning - Methods and Applications in Bioinformatics and Text Mining , 2009, Studies in Computational Intelligence.

[9]  Svante Wold,et al.  Modelling and diagnostics of batch processes and analogous kinetic experiments , 1998 .

[10]  David S. Wishart,et al.  Quantitative metabolomics using NMR , 2008 .

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  Age K. Smilde,et al.  Generic framework for high-dimensional fixed-effects ANOVA , 2012, Briefings Bioinform..

[13]  Paul Geladi,et al.  Principles of Proper Validation: use and abuse of re‐sampling for validation , 2010 .

[14]  Age K. Smilde,et al.  UvA-DARE ( Digital Academic Repository ) Assessment of PLSDA cross validation , 2008 .

[15]  A. Tenenhaus,et al.  Regularized Generalized Canonical Correlation Analysis , 2011, Eur. J. Oper. Res..

[16]  H. K. Kim,et al.  NMR-based metabolomics at work in phytochemistry , 2007, Phytochemistry Reviews.

[17]  A. Smilde,et al.  On the increase of predictive performance with high-level data fusion. , 2011, Analytica Chimica Acta.

[18]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[19]  Serge Rudaz,et al.  Knowledge discovery in metabolomics: an overview of MS data handling. , 2010, Journal of separation science.

[20]  Age K. Smilde,et al.  Analysis of longitudinal metabolomics data , 2004, Bioinform..

[21]  Dominique Bertrand,et al.  Common components and specific weights analysis: A chemometric method for dealing with complexity of food products , 2006 .

[22]  Lutgarde M. C. Buydens,et al.  Fusion of metabolomics and proteomics data for biomarkers discovery: case study on the experimental autoimmune encephalomyelitis , 2011, BMC Bioinformatics.

[23]  Robert S Plumb,et al.  Statistical heterospectroscopy, an approach to the integrated analysis of NMR and UPLC-MS data sets: application in metabonomic toxicology studies. , 2006, Analytical chemistry.

[24]  Wynne W. Chin,et al.  Handbook of Partial Least Squares , 2010 .

[25]  Marieke E. Timmerman,et al.  Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences , 2003 .

[26]  Age K. Smilde,et al.  Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies , 2011, Metabolomics.

[27]  Mostafa El Qannari,et al.  From Multiblock Partial Least Squares to Multiblock Redundancy Analysis. A Continuum Approach , 2011, Informatica.

[28]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[29]  J. W. Allwood,et al.  1H NMR, GC-EI-TOFMS, and data set correlation for fruit metabolomics: application to spatial metabolite analysis in melon. , 2009, Analytical chemistry.

[30]  Peter de B. Harrington,et al.  Analysis of variance–principal component analysis: A soft tool for proteomic discovery , 2005 .

[31]  Svante Wold,et al.  Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection , 1996 .

[32]  Daniel Eriksson,et al.  Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. , 2007, The Plant journal : for cell and molecular biology.

[33]  J. Westerhuis,et al.  Multivariate modelling of the pharmaceutical two‐step process of wet granulation and tableting with multiblock partial least squares , 1997 .

[34]  Randeep Rakwal,et al.  Using metabolic profiling to assess plant-pathogen interactions: an example using rice (Oryza sativa) and the blast pathogen Magnaporthe grisea , 2011, European Journal of Plant Pathology.

[35]  Serge Rudaz,et al.  Optimized liquid chromatography-mass spectrometry approach for the isolation of minor stress biomarkers in plant extracts and their identification by capillary nuclear magnetic resonance. , 2008, Journal of chromatography. A.

[36]  Y. Choi,et al.  NMR-based metabolomic analysis of plants , 2010, Nature Protocols.

[37]  Ralf Steuer,et al.  Review: On the analysis and interpretation of correlations in metabolomic data , 2006, Briefings Bioinform..

[38]  T. Næs,et al.  Path modelling by sequential PLS regression , 2011 .

[39]  Lutgarde M. C. Buydens,et al.  Interpretation and Visualization of Non-Linear Data Fusion in Kernel Space: Study on Metabolomic Characterization of Progression of Multiple Sclerosis , 2012, PloS one.

[40]  Age K Smilde,et al.  Analyzing longitudinal microbial metabolomics data. , 2009, Journal of proteome research.

[41]  A. Webb,et al.  Microcoil nuclear magnetic resonance spectroscopy. , 2005, Journal of pharmaceutical and biomedical analysis.

[42]  A. K. Smilde,et al.  Dynamic metabolomic data analysis: a tutorial review , 2009, Metabolomics.

[43]  R. Bro Multiway calibration. Multilinear PLS , 1996 .

[44]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[45]  Age K. Smilde,et al.  Exploring the analysis of structured metabolomics data , 2009 .

[46]  Serge Rudaz,et al.  A steroidomic approach for biomarkers discovery in doping control. , 2011, Forensic science international.

[47]  Jérôme Pagès,et al.  Multiple factor analysis (AFMULT package) , 1994 .

[48]  J. Rabinowitz,et al.  Analytical strategies for LC-MS-based targeted metabolomics. , 2008, Journal of chromatography. B, Analytical technologies in the biomedical and life sciences.

[49]  R. Brereton Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data , 2006 .

[50]  Paul J. Van den Brink,et al.  Principal response curves: Analysis of time‐dependent multivariate responses of biological community to stress , 1999 .

[51]  Agnar Höskuldsson,et al.  Multi‐block methods in multivariate process control , 2008 .

[52]  U. Sauer,et al.  Cross-platform comparison of methods for quantitative metabolomics of primary metabolism. , 2009, Analytical chemistry.

[53]  Michel Tenenhaus,et al.  A Bridge Between PLS Path Modeling and Multi-Block Data Analysis , 2010 .

[54]  Yan Ni,et al.  Metabolic profiling using combined GC–MS and LC–MS provides a systems understanding of aristolochic acid‐induced nephrotoxicity in rat , 2007, FEBS letters.

[55]  O. Fiehn Metabolomics – the link between genotypes and phenotypes , 2004, Plant Molecular Biology.

[56]  I. Noda Generalized Two-Dimensional Correlation Method Applicable to Infrared, Raman, and other Types of Spectroscopy , 1993 .

[57]  Véronique Bellon-Maurel,et al.  Authenticating white grape must variety with classification models based on aroma sensors, FT-IR and UV spectrometry , 2003 .

[58]  Tahir Mehmood,et al.  A review of variable selection methods in Partial Least Squares Regression , 2012 .

[59]  Michael J. Piovoso,et al.  On unifying multiblock analysis with application to decentralized process monitoring , 2001 .

[60]  I. Wilson,et al.  A multi-analytical platform approach to the metabonomic analysis of plasma from normal and Zucker (fa/fa) obese rats. , 2006, Molecular bioSystems.

[61]  Tommy Löfstedt,et al.  OnPLS—a novel multiblock method for the modelling of predictive and orthogonal variation , 2011 .

[62]  El Mostafa Qannari,et al.  Common components and specific weight analysis and multiple co‐inertia analysis applied to the coupling of several measurement techniques , 2006 .

[63]  V. Steinmetz,et al.  A Methodology for Sensor Fusion Design: Application to Fruit Quality Assessment , 1999 .

[64]  S. de Jong,et al.  A framework for sequential multiblock component methods , 2003 .

[65]  V. Steinmetz,et al.  Sensors for fruit firmness assessment : Comparison and fusion , 1996 .

[66]  Timothy M. D. Ebbels,et al.  Piecewise multivariate modelling of sequential metabolic profiling data , 2008, BMC Bioinformatics.

[67]  Kristian Hovde Liland,et al.  Multivariate methods in metabolomics – from pre-processing to dimension reduction and statistical analysis , 2011 .

[68]  Kazuki Saito,et al.  Metabolomics for functional genomics, systems biology, and biotechnology. , 2010, Annual review of plant biology.

[69]  Iven Van Mechelen,et al.  A generic linked-mode decomposition model for data fusion , 2010 .

[70]  T. Ebbels,et al.  Geometric trajectory analysis of metabolic responses to toxicity can define treatment specific profiles. , 2004, Chemical research in toxicology.

[71]  Matej Oresic,et al.  MPEA - metabolite pathway enrichment analysis , 2011, Bioinform..

[72]  Shuhui Cai,et al.  Statistical two-dimensional correlation spectroscopy of urine and serum from metabolomics data , 2012 .

[73]  B. Khoromskij Tensors-structured Numerical Methods in Scientific Computing: Survey on Recent Advances , 2012 .