Probabilistic principal component analysis for metabolomic data

BackgroundData from metabolomic studies are typically complex and high-dimensional. Principal component analysis (PCA) is currently the most widely used statistical technique for analyzing metabolomic data. However, PCA is limited by the fact that it is not based on a statistical model.ResultsHere, probabilistic principal component analysis (PPCA) which addresses some of the limitations of PCA, is reviewed and extended. A novel extension of PPCA, called probabilistic principal component and covariates analysis (PPCCA), is introduced which provides a flexible approach to jointly model metabolomic data and additional covariate information. The use of a mixture of PPCA models for discovering the number of inherent groups in metabolomic data is demonstrated. The jackknife technique is employed to construct confidence intervals for estimated model parameters throughout. The optimal number of principal components is determined through the use of the Bayesian Information Criterion model selection tool, which is modified to address the high dimensionality of the data.ConclusionsThe methods presented are illustrated through an application to metabolomic data sets. Jointly modeling metabolomic data and covariates was successfully achieved and has the potential to provide deeper insight to the underlying data structure. Examination of confidence intervals for the model parameters, such as loadings, allows for principled and clear interpretation of the underlying data structure. A software package called MetabolAnalyze, freely available through the R statistical software, has been developed to facilitate implementation of the presented methods in the metabolomics field.

[1]  B. Biswal,et al.  Use of Jackknife Resampling Techniques to Estimate the Confidence Intervals of fMRI Parameters , 2001, Journal of computer assisted tomography.

[2]  M. Walsh,et al.  Effect of acute dietary standardization on the urinary, plasma, and salivary metabolomic profiles of healthy humans. , 2006, The American journal of clinical nutrition.

[3]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[4]  I. C. Gormley,et al.  Exploring Voting Blocs Within the Irish Electorate , 2008 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Michael I. Jordan,et al.  Mixtures of Probabilistic Principal Component Analyzers , 2001 .

[7]  H. Keun,et al.  Metabonomic modeling of drug toxicity. , 2006, Pharmacology & therapeutics.

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  B. Hammock,et al.  Mass spectrometry-based metabolomics. , 2007, Mass spectrometry reviews.

[10]  Philip W. Kuchel,et al.  Metabonomics Based on NMR Spectroscopy , 2004 .

[11]  Wei-Chien Chang On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .

[12]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[13]  I. Jolliffe Principal Component Analysis , 2002 .

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Alexander Schliep,et al.  Inferring differentiation pathways from gene expression , 2008, ISMB.

[16]  Lorraine Brennan,et al.  Effects of pentylenetetrazole-induced seizures on metabolomic profiles of rat brain , 2010, Neurochemistry International.

[17]  Christopher M. Bishop,et al.  A Hierarchical Latent Variable Model for Data Visualization , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  H. Akaike A new look at the statistical model identification , 1974 .

[19]  I. C. Gormley,et al.  Exploring Voting Blocs Within the Irish Electorate , 2008 .

[20]  Paul D. McNicholas,et al.  Parsimonious Gaussian mixture models , 2008, Stat. Comput..

[21]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[22]  Lorraine Brennan,et al.  Session 2: Personalised nutrition Metabolomic applications in nutritional research , 2008, Proceedings of the Nutrition Society.

[23]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[24]  Byron Hall Bayesian Inference , 2011 .

[25]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[26]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[27]  N. Reo NMR-BASED METABOLOMICS , 2002, Drug and chemical toxicology.

[28]  Adrian E. Raftery,et al.  Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering , 2007, J. Classif..

[29]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[30]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[31]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[32]  L. Joseph,et al.  Bayesian Statistics: An Introduction , 1989 .

[33]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[34]  Christopher M. Bishop,et al.  Bayesian PCA , 1998, NIPS.

[35]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[36]  M. Walsh,et al.  Metabolomics in human nutrition: opportunities and challenges. , 2005, The American journal of clinical nutrition.