Real-life metabolomics data analysis : how to deal with complex data ?

MOTIVATION Modern functional genomics generates high-dimensional datasets. It is often convenient to have a single simple number characterizing the relationship between pairs of such high-dimensional datasets in a comprehensive way. Matrix correlations are such numbers and are appealing since they can be interpreted in the same way as Pearson's correlations familiar to biologists. The high-dimensionality of functional genomics data is, however, problematic for existing matrix correlations. The motivation of this article is 2-fold: (i) we introduce the idea of matrix correlations to the bioinformatics community and (ii) we give an improvement of the most promising matrix correlation coefficient (the RV-coefficient) circumventing the problems of high-dimensional data. RESULTS The modified RV-coefficient can be used in high-dimensional data analysis studies as an easy measure of common information of two datasets. This is shown by theoretical arguments, simulations and applications to two real-life examples from functional genomics, i.e. a transcriptomics and metabolomics example. AVAILABILITY The Matlab m-files of the methods presented can be downloaded from http://www.bdagroup.nl.

[1]  J. Wolfowitz,et al.  Introduction to the Theory of Statistics. , 1951 .

[2]  J. Ramsay,et al.  Matrix correlation , 1984 .

[3]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[4]  S. de Jong,et al.  A framework for sequential multiblock component methods , 2003 .

[5]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Age K Smilde,et al.  Atherosclerosis and liver inflammation induced by increased dietary cholesterol intake: a combined transcriptomics and metabolomics analysis , 2007, Genome Biology.

[7]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[8]  R. Bro,et al.  Centering and scaling in component analysis , 2003 .

[9]  H. J. Larson,et al.  Introduction to the Theory of Statistics , 1973 .

[10]  A. Smilde,et al.  Fusion of mass spectrometry-based metabolomics data. , 2005, Analytical chemistry.

[11]  Haruo Yanai UNIFICATION OF VARIOUS TECHNIQUES OF MULTIVARIATE ANALYSIS BY MEANS OF GENERALIZED COEFFICIENT OF DETERMINATION , 1974 .

[12]  R. Sabatier,et al.  Refined approximations to permutation tests for multivariate inference , 1995 .

[13]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[14]  J. Berge,et al.  Tucker's congruence coefficient as a meaningful index of factor similarity. , 2006 .

[15]  A. Smilde,et al.  Increased dietary cholesterol-induced atherosclerosis is associated with liver inflammation: Identification of novel regulatory pathways and transcriptional regulators involved in switch from metabolic adaptation to inflammatory state. , 2007 .