Multivariate multi-way analysis of multi-source data

Motivation: Analysis of variance (ANOVA)-type methods are the default tool for the analysis of data with multiple covariates. These tools have been generalized to the multivariate analysis of high-throughput biological datasets, where the main challenge is the problem of small sample size and high dimensionality. However, the existing multi-way analysis methods are not designed for the currently increasingly important experiments where data is obtained from multiple sources. Common examples of such settings include integrated analysis of metabolic and gene expression profiles, or metabolic profiles from several tissues in our case, in a controlled multi-way experimental setup where disease status, medical treatment, gender and time-series are usual covariates. Results: We extend the applicability area of multivariate, multi-way ANOVA-type methods to multi-source cases by introducing a novel Bayesian model. The method is capable of finding covariate-related dependencies between the sources. It assumes the measurements consist of groups of similarly behaving variables, and estimates the multivariate covariate effects and their interaction effects for the discovered groups of variables. In particular, the method partitions the effects to those shared between the sources and to source-specific ones. The method is specifically designed for datasets with small sample sizes and high dimensionality. We apply the method to a lipidomics dataset from a lung cancer study with two-way experimental setup, where measurements from several tissues with mostly distinct lipids have been taken. The method is also directly applicable to gene expression and proteomics. Availability: An R-implementation is available at http://www.cis.hut.fi/projects/mi/software/multiWayCCA/ Contact: ilkka.huopaniemi@tkk.fi; samuel.kaski@tkk.fi

[1]  Chong Wang,et al.  Variational Bayesian Approach to Canonical Correlation Analysis , 2007, IEEE Transactions on Neural Networks.

[2]  M. Taskinen,et al.  Serum saturated fatty acids containing triacylglycerols are better markers of insulin resistance than total serum triacylglycerol concentrations , 2009, Diabetologia.

[3]  Pascal J. Goldschmidt-Clermont,et al.  Of mice and men: Sparse statistical modeling in cardiovascular genomics , 2007, 0709.0165.

[4]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[5]  Samuel Kaski,et al.  Local dependent components , 2007, ICML '07.

[6]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[7]  M. West,et al.  Cross-Study Projections of Genomic Biomarkers: An Evaluation in Cancer Genomics , 2009, PloS one.

[8]  Joel G. Pounds,et al.  Pacific Symposium on Biocomputing 14:451-463 (2009) A BAYESIAN INTEGRATION MODEL OF HIGH- THROUGHPUT PROTEOMICS AND METABOLOMICS DATA FOR IMPROVED EARLY DETECTION OF MICROBIAL INFECTIONS , 2022 .

[9]  Ronald J A Wanders,et al.  Functions and biosynthesis of plasmalogens in health and disease. , 2004, Biochimica et biophysica acta.

[10]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[11]  Matej Oresic,et al.  Two-way analysis of high-dimensional collinear data , 2009, Data Mining and Knowledge Discovery.

[12]  David Tritchler,et al.  Genome-wide sparse canonical correlation of gene expression with genotypes , 2007, BMC proceedings.

[13]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[14]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[15]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[16]  Mingjun Zhong,et al.  Data Integration for Classification Problems Employing Gaussian Process Priors , 2006, NIPS.

[17]  M. Orešič,et al.  Comparison of Lipid and Fatty Acid Composition of the Liver, Subcutaneous and Intra‐abdominal Adipose Tissue, and Serum , 2010, Obesity.

[18]  Ø. Langsrud,et al.  50–50 multivariate analysis of variance for collinear responses , 2002 .

[19]  S. Summers,et al.  Ceramides in insulin resistance and lipotoxicity. , 2006, Progress in lipid research.

[20]  Christopher M. Bishop,et al.  Bayesian PCA , 1998, NIPS.

[21]  Dolly Mehta,et al.  Lysophosphatidylcholine: an enigmatic lysolipid. , 2005, American journal of physiology. Lung cellular and molecular physiology.

[22]  Francis R. Bach,et al.  Sparse probabilistic projections , 2008, NIPS.

[23]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[24]  Matej Oresic,et al.  MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data , 2006, Bioinform..

[25]  Ralf Steuer,et al.  Review: On the analysis and interpretation of correlations in metabolomic data , 2006, Briefings Bioinform..

[26]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[27]  Aeilko H Zwinderman,et al.  Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers , 2007, BMC proceedings.