Common and distinct components in data fusion

In many areas of science, multiple sets of data are collected pertaining to the same system. Examples are food products that are characterized by different sets of variables, bioprocesses that are online sampled with different instruments, or biological systems of which different genomic measurements are obtained. Data fusion is concerned with analyzing such sets of data simultaneously to arrive at a global view of the system under study. One of the upcoming areas of data fusion is exploring whether the data sets have something in common or not. This gives insight into common and distinct variation in each data set, thereby facilitating understanding of the relationships between the data sets. Unfortunately, research on methods to distinguish common and distinct components is fragmented, both in terminology and in methods: There is no common ground that hampers comparing methods and understanding their relative merits. This paper provides a unifying framework for this subfield of data fusion by using rigorous arguments from linear algebra. The most frequently used methods for distinguishing common and distinct components are explained in this framework, and some practical examples are given of these methods in the areas of medical biology and food science.

[1]  R. A. van den Berg,et al.  Simultaneous analysis of coupled data matrices subject to different amounts of noise. , 2011, The British journal of mathematical and statistical psychology.

[2]  Christian Jutten,et al.  Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects , 2015, Proceedings of the IEEE.

[3]  Lieven De Lathauwer,et al.  An extension of the generalized SVD for more than two matrices , 2009 .

[4]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[5]  Age K. Smilde,et al.  Separating common from distinctive variation , 2016, BMC Bioinformatics.

[6]  Marieke E. Timmerman,et al.  Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences , 2003 .

[7]  Dayanthi Nugegoda,et al.  Nuclear magnetic resonance metabonomic profiling using tO2PLS. , 2013, Analytica chimica acta.

[8]  Vince D. Calhoun,et al.  Canonical Correlation Analysis for Data Fusion and Group Inferences , 2010, IEEE Signal Processing Magazine.

[9]  Thomas Hankemeier,et al.  Roux-en-Y Gastric Bypass Surgery, but Not Calorie Restriction, Reduces Plasma Branched-Chain Amino Acids in Obese Women Independent of Weight Loss or the Presence of Type 2 Diabetes , 2014, Diabetes Care.

[10]  J. Trygg O2‐PLS for qualitative and quantitative analysis in multivariate calibration , 2002 .

[11]  Age K. Smilde,et al.  ANOVA-simultaneous component analysis (ASCA): a new tool for analyzing designed metabolomics data , 2005, Bioinform..

[12]  Andreas Karlsson,et al.  Matrix Analysis for Statistics , 2007, Technometrics.

[13]  Rasmus Bro,et al.  Multi‐way models for sensory profiling data , 2008 .

[14]  Rasmus Bro,et al.  Structure-revealing data fusion , 2014, BMC Bioinformatics.

[15]  David E. Booth,et al.  Multi-Way Analysis: Applications in the Chemical Sciences , 2005, Technometrics.

[16]  P. Legendre,et al.  Variation partitioning of species data matrices: estimation and comparison of fractions. , 2006, Ecology.

[17]  Xueguang Shao,et al.  Multilevel analysis of temperature dependent near-infrared spectra. , 2015, Talanta.

[18]  Daniel Eriksson,et al.  Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. , 2007, The Plant journal : for cell and molecular biology.

[19]  Nikos D. Sidiropoulos,et al.  On communication diversity for blind identifiability and the uniqueness of low-rank decomposition of N-way arrays , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  M. Saunders,et al.  Towards a Generalized Singular Value Decomposition , 1981 .

[21]  Tommy Löfstedt,et al.  Global, local and unique decompositions in OnPLS for multiblock data analysis. , 2013, Analytica chimica acta.

[22]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[23]  H. Zha,et al.  A tree of generalizations of the ordinary singular value decomposition , 1991 .

[24]  I. Mechelen,et al.  SCA with rotation to distinguish common and distinctive information in linked data , 2013, Behavior Research Methods.

[25]  Johan Trygg,et al.  O2‐PLS, a two‐block (X–Y) latent variable regression (LVR) method with an integral OSC filter , 2003 .

[26]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[27]  C. Loan Generalizing the Singular Value Decomposition , 1976 .

[28]  Garmt Dijksterhuis,et al.  Generalised canonical analysis of individual sensory profiles and instrumental data , 1996 .

[29]  Iven Van Mechelen,et al.  UvA-DARE ( Digital Academic Repository ) A structured overview of simultaneous component based data integration , 2009 .

[30]  Mohamed Hanafi,et al.  Analysis of K sets of data, with differential emphasis on agreement between and within sets , 2006, Comput. Stat. Data Anal..

[31]  Rasmus Bro,et al.  Data Fusion in Metabolomics Using Coupled Matrix and Tensor Factorizations , 2015, Proceedings of the IEEE.

[32]  J. Geer Linear relations amongk sets of variables , 1984 .

[33]  L. De Lathauwer,et al.  DISCO-SCA and Properly Applied GSVD as Swinging Methods to Find Common and Distinctive Processes , 2012, PloS one.

[34]  Federico Marini,et al.  1H NMR-based urinary metabolic profiling reveals changes in nicotinamide pathway intermediates due to postnatal stress model in rat. , 2014, Journal of proteome research.

[35]  C. Lynch,et al.  Branched-chain amino acids in metabolic signalling and insulin resistance , 2014, Nature Reviews Endocrinology.

[36]  Federico Marini,et al.  Application of near infrared (NIR) spectroscopy coupled to chemometrics for dried egg-pasta characterization and egg content quantification. , 2013, Food chemistry.

[37]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[38]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[39]  Tommy Löfstedt,et al.  OnPLS—a novel multiblock method for the modelling of predictive and orthogonal variation , 2011 .

[40]  J. Berge,et al.  Orthogonal procrustes rotation for two or more matrices , 1977 .

[41]  Lawrence Carin,et al.  Bayesian joint analysis of heterogeneous genomics data , 2014, Bioinform..

[42]  Bart De Moor On the structure and geometry of the product singular value decomposition , 1989 .

[43]  Di Wu,et al.  Quantitative and predictive study of the evolution of wine quality parameters during high hydrostatic pressure processing , 2013 .

[44]  Iven Van Mechelen,et al.  A generic linked-mode decomposition model for data fusion , 2010 .

[45]  Geoffrey J. Gordon,et al.  Relational learning via collective matrix factorization , 2008, KDD.

[46]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[47]  R. Bro,et al.  Centering and scaling in component analysis , 2003 .

[48]  Tormod Næs,et al.  Multi-block regression based on combinations of orthogonalisation, PLS-regression and canonical correlation analysis , 2013 .

[49]  K SmildeAge,et al.  ANOVA-simultaneous component analysis (ASCA) , 2005 .

[50]  Onno E. de Noord,et al.  Multilevel component analysis and multilevel PLS of chemical process data , 2005 .

[51]  S. de Jong,et al.  A framework for sequential multiblock component methods , 2003 .

[52]  Gilson Luiz Volpato,et al.  Aggressiveness Overcomes Body-Size Effects in Fights Staged between Invasive and Native Fish Species with Overlapping Niches , 2012, PloS one.

[53]  Ali Taylan Cemgil,et al.  Optimal weight learning for Coupled Tensor Factorization with mixed divergences , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[54]  Rasmus Bro,et al.  Multi-way Analysis with Applications in the Chemical Sciences , 2004 .

[55]  Lothar Willmitzer,et al.  Linking Gene Expression and Membrane Lipid Composition of Arabidopsis[W][OPEN] , 2014, Plant Cell.

[56]  Giuseppe Giordano,et al.  Authentication of Trappist beers by LC-MS fingerprints and multivariate data analysis. , 2010, Journal of agricultural and food chemistry.

[57]  Tormod Næs,et al.  A bridge between Tucker-1 and Carroll's generalized canonical analysis , 2006, Comput. Stat. Data Anal..

[58]  O. Alter,et al.  A Higher-Order Generalized Singular Value Decomposition for Comparison of Global mRNA Expression from Multiple Organisms , 2011, PloS one.

[59]  B. Kowalski,et al.  Selectivity, local rank, three‐way data analysis and ambiguity in multivariate curve resolution , 1995 .

[60]  Jérôme Pagès,et al.  Collection and analysis of perceived product inter-distances using multiple factor analysis: Application to the study of 10 white wines from the Loire Valley , 2005 .

[61]  Laura Ruth Cagliani,et al.  Evaluation of saffron (Crocus sativus L.) adulteration with plant adulterants by (1)H NMR metabolite fingerprinting. , 2015, Food chemistry.

[62]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[63]  J. Kettenring,et al.  Canonical Analysis of Several Sets of Variables , 2022 .

[64]  Patrik Rydén,et al.  OnPLS integration of transcriptomic, proteomic and metabolomic data shows multi-level oxidative stress responses in the cambium of transgenic hipI- superoxide dismutase Populus plants , 2013, BMC Genomics.

[65]  I. Mechelen,et al.  Identifying common and distinctive processes underlying multiset data , 2013 .

[66]  Tormod Næs,et al.  Preference mapping by PO-PLS: Separating common and unique information in several data blocks , 2012 .

[67]  I. W. Molenaar,et al.  Statistics In The Social And Behavioral Sciences , 1985 .

[68]  Arthur Tenenhaus,et al.  Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis , 2013, Eur. J. Oper. Res..

[69]  R. Consonni,et al.  Evaluation of the production year in Italian and Chinese tomato paste for geographical determination using O2PLS models. , 2010, Journal of agricultural and food chemistry.

[70]  Age K Smilde,et al.  Metabolomics data exploration guided by prior knowledge. , 2009, Analytica chimica acta.

[71]  Age K. Smilde,et al.  Multilevel component analysis of time-resolved metabolic fingerprinting data , 2005 .

[72]  T. Næs,et al.  Multivariate analysis of data in sensory science , 1996 .

[73]  R. A. van den Berg,et al.  Centering, scaling, and transformations: improving the biological information content of metabolomics data , 2006, BMC Genomics.

[74]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[75]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .