Integrative Analysis of Metabolomics and Transcriptomics Data: A Unified Model Framework to Identify Underlying System Pathways

The abundance of high-dimensional measurements in the form of gene expression and mass spectroscopy calls for models to elucidate the underlying biological system. For widely studied organisms like yeast, it is possible to incorporate prior knowledge from a variety of databases, an approach used in several recent studies. However if such information is not available for a particular organism these methods fall short. In this paper we propose a statistical method that is applicable to a dataset consisting of Liquid Chromatography-Mass Spectroscopy (LC-MS) and gene expression (DNA microarray) measurements from the same samples, to identify genes controlling the production of metabolites. Due to the high dimensionality of both LC-MS and DNA microarray data, dimension reduction and variable selection are key elements of the analysis. Our proposed approach starts by identifying the basis functions (“building blocks”) that constitute the output from a mass spectrometry experiment. Subsequently, the weights of these basis functions are related to the observations from the corresponding gene expression data in order to identify which genes are associated with specific patterns seen in the metabolite data. The modeling framework is extremely flexible as well as computationally fast and can accommodate treatment effects and other variables related to the experimental design. We demonstrate that within the proposed framework, genes regulating the production of specific metabolites can be identified correctly unless the variation in the noise is more than twice that of the signal.

[1]  Yutaka Yamada,et al.  PRIMe Update: Innovative Content for Plant Metabolomics and Integration of Gene Expression and Metabolite Accumulation , 2013, Plant & cell physiology.

[2]  Rasmus Bro,et al.  Multi-way Analysis with Applications in the Chemical Sciences , 2004 .

[3]  B. Møller,et al.  Cytochromes P-450 from cassava (Manihot esculenta Crantz) catalyzing the first steps in the biosynthesis of the cyanogenic glucosides linamarin and lotaustralin. Cloning, functional expression in Pichia pastoris, and substrate specificity of the isolated recombinant enzymes. , 2000, The Journal of biological chemistry.

[4]  Aedín C Culhane,et al.  A multivariate analysis approach to the integration of proteomic and gene expression data , 2007, Proteomics.

[5]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[6]  Gang Su,et al.  Integrated metabolome and transcriptome analysis of the NCI60 dataset , 2011, BMC Bioinformatics.

[7]  Christina Chan,et al.  Integrating Gene Expression and Metabolic Profiles* , 2004, Journal of Biological Chemistry.

[8]  M. Hirai,et al.  Elucidation of Gene-to-Gene and Metabolite-to-Gene Networks in Arabidopsis by Integration of Metabolomics and Transcriptomics* , 2005, Journal of Biological Chemistry.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Shigehiko Kanaya,et al.  Informatics for unveiling hidden genome signatures. , 2003, Genome research.

[11]  David E. Booth,et al.  Multi-Way Analysis: Applications in the Chemical Sciences , 2005, Technometrics.

[12]  Michael Witting,et al.  MassTRIX Reloaded: Combined Analysis and Visualization of Transcriptome and Metabolome Data , 2012, PloS one.

[13]  B. Møller,et al.  Cytochromes P-450 from Cassava (Manihot esculentaCrantz) Catalyzing the First Steps in the Biosynthesis of the Cyanogenic Glucosides Linamarin and Lotaustralin , 2000, The Journal of Biological Chemistry.

[14]  L. Tucker,et al.  Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[15]  Andrzej Cichocki,et al.  Nonnegative Matrix and Tensor Factorization T , 2007 .

[16]  Barry Smith,et al.  The Plant Ontology as a Tool for Comparative Plant Anatomy and Genomic Analyses , 2012, Plant & cell physiology.

[17]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[18]  Weiwen Zhang,et al.  Integrating multiple 'omics' analysis for microbial biology: application and methodologies. , 2010, Microbiology.

[19]  Timothy M. D. Ebbels,et al.  Integrated pathway-level analysis of transcriptomics and metabolomics data with IMPaLA , 2011 .

[20]  Gang Wu,et al.  Integrative Analysis of Transcriptomic and Proteomic Data: Challenges, Solutions and Applications , 2007, Critical reviews in biotechnology.

[21]  J. Nicholson,et al.  Metabolome, transcriptome, and bioinformatic cis-element analyses point to HNF-4 as a central regulator of gene expression during enterocyte differentiation. , 2006, Physiological genomics.

[22]  Jamin C. Hoggard,et al.  Parallel factor analysis (PARAFAC) of target analytes in GC x GC-TOFMS data: automated selection of a model with an appropriate number of factors. , 2007, Analytical chemistry.

[23]  R. Bro,et al.  A new efficient method for determining the number of components in PARAFAC models , 2003 .

[24]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[25]  Markku Hauta-Kasari,et al.  Nonnegative Tensor Factorization Accelerated Using GPGPU , 2011, IEEE Transactions on Parallel and Distributed Systems.

[26]  Konstantinos N. Plataniotis,et al.  Ultrafast Technique of Impulsive Noise Removal with Application to Microarray Image Denoising , 2005, ICIAR.