Structure-revealing data fusion model with applications in metabolomics

In many disciplines, data from multiple sources are acquired and jointly analyzed for enhanced knowledge discovery. For instance, in metabolomics, different analytical techniques are used to measure biological fluids in order to identify the chemicals related to certain diseases. It is widely-known that, some of these analytical methods, e.g., LC-MS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) spectroscopy, provide complementary data sets and their joint analysis may enable us to capture a larger proportion of the complete metabolome belonging to a specific biological system. Fusing data from multiple sources has proved useful in many fields including bioinformatics, signal processing and social network analysis. However, identification of common (shared) and individual (unshared) structures across multiple data sets remains a major challenge in data fusion studies. With a goal of addressing this challenge, we propose a novel unsupervised data fusion model. Our contributions are two-fold: (i) We formulate a data fusion model based on joint factorization of matrices and higher-order tensors, which can automatically reveal common and individual components. (ii) We demonstrate that the proposed approach provides promising results in joint analysis of metabolomics data sets consisting of fluorescence and NMR measurements of plasma samples in terms of separation of colorectal cancer patients from controls.

[1]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[2]  Bülent Yener,et al.  Coupled Analysis of In Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship , 2012, PloS one.

[3]  R. Harshman,et al.  PARAFAC: parallel factor analysis , 1994 .

[4]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[5]  H. Nielsen,et al.  Data fusion in metabolomic cancer diagnostics , 2012, Metabolomics.

[6]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[7]  Age K. Smilde,et al.  Multiway multiblock component and covariates regression models , 2000 .

[8]  Bülent Yener,et al.  Unsupervised Multiway Data Analysis: A Literature Survey , 2009, IEEE Transactions on Knowledge and Data Engineering.

[9]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[10]  Jimeng Sun,et al.  MetaFac: community discovery via relational hypergraph factorization , 2009, KDD.

[11]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[12]  L. De Lathauwer,et al.  DISCO-SCA and Properly Applied GSVD as Swinging Methods to Find Common and Distinctive Processes , 2012, PloS one.

[13]  Ali Taylan Cemgil,et al.  Link Prediction via Generalized Coupled Tensor Factorisation , 2012, ArXiv.

[14]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[15]  Eva Ceulemans,et al.  The LMPCA program: A graphical user interface for fitting the linked-mode PARAFAC-PCA model to coupled real-valued data , 2009, Behavior research methods.

[16]  Timothy M. D. Ebbels,et al.  Intra- and inter-omic fusion of metabolic profiling data in a systems biology framework , 2010 .

[17]  Svetha Venkatesh,et al.  Nonnegative shared subspace learning and its application to social media retrieval , 2010, KDD.

[18]  K. Badizadegan,et al.  NAD(P)H and collagen as in vivo quantitative fluorescent biomarkers of epithelial precancerous changes. , 2002, Cancer research.

[19]  Tamara G. Kolda,et al.  Poblano v1.0: A Matlab Toolbox for Gradient-Based Optimization , 2010 .

[20]  Tamara G. Kolda,et al.  All-at-once Optimization for Coupled Matrix and Tensor Factorizations , 2011, ArXiv.

[21]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Arindam Banerjee,et al.  Multi-way Clustering on Relation Graphs , 2007, SDM.

[23]  Liviu Badea,et al.  Extracting Gene Expression Profiles Common to Colon and Pancreatic Adenocarcinoma Using Simultaneous Nonnegative Matrix Factorization , 2007, Pacific Symposium on Biocomputing.

[24]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[25]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .