D-GCCA: Decomposition-based Generalized Canonical Correlation Analysis for Multiple High-dimensional Datasets

Modern biomedical studies often collect multiple types of high-dimensional data on a common set of objects. A popular model for the joint analysis of multi-type datasets decomposes each data matrix into a low-rank common-variation matrix generated by latent factors shared across all datasets, a low-rank distinctive-variation matrix corresponding to each dataset, and an additive noise matrix. We propose decomposition-based generalized canonical correlation analysis (D-GCCA), a novel decomposition method that appropriately defines those matrices on the L2 space of random variables, whereas most existing methods are developed on its approximation, the Euclidean dot product space. Moreover to well calibrate common latent factors, we impose a desirable orthogonality constraint on distinctive latent factors. Existing methods inadequately consider such orthogonality and can thus suffer from substantial loss of undetected common variation. Our D-GCCA takes one step further than GCCA by separating common and distinctive variations among canonical variables, and enjoys an appealing interpretation from the perspective of principal component analysis. Consistent estimators of our common-variation and distinctive-variation matrices are established with good finite-sample numerical performance, and have closed-form expressions leading to efficient computation especially for large-scale datasets. The superiority of D-GCCA over state-of-the-art methods is also corroborated in simulations and real-world data examples.

[1]  Tengyao Wang,et al.  A useful variant of the Davis--Kahan theorem for statisticians , 2014, 1405.0680.

[2]  Mirosław Krzyśko,et al.  A Closed Testing Procedure for Canonical Correlations , 2005 .

[3]  J. S. Marron,et al.  SWISS MADE: Standardized WithIn Class Sum of Squares to Evaluate Methodologies and Dataset Elements , 2010, PloS one.

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  L. Staudt,et al.  The NCI Genomic Data Commons as an engine for precision medicine. , 2017, Blood.

[6]  Bruce A. Draper,et al.  A flag representation for finite collections of subspaces of mixed dimensions , 2014 .

[7]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[8]  J. S. Marron,et al.  Angle-based joint and individual variation explained , 2017, J. Multivar. Anal..

[9]  Abraham Z. Snyder,et al.  Function in the human connectome: Task-fMRI and individual differences in behavior , 2013, NeuroImage.

[10]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[11]  Andrzej Cichocki,et al.  Group Component Analysis for Multiblock Data: Common and Individual Feature Extraction , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Xuming He,et al.  Dimension reduction based on constrained canonical correlation and variable filtering , 2008, 0808.0977.

[13]  Tommy Löfstedt,et al.  OnPLS—a novel multiblock method for the modelling of predictive and orthogonal variation , 2011 .

[14]  M. Bartlett THE STATISTICAL SIGNIFICANCE OF CANONICAL CORRELATIONS , 1941 .

[15]  Antonio P. Strafella,et al.  Imaging biomarkers in Parkinson’s disease and Parkinsonian syndromes: current and emerging concepts , 2017, Translational Neurodegeneration.

[16]  Marisa O. Hollinshead,et al.  The organization of the human cerebral cortex estimated by intrinsic functional connectivity. , 2011, Journal of neurophysiology.

[17]  Age K. Smilde,et al.  Separating common from distinctive variation , 2016, BMC Bioinformatics.

[18]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[19]  L. Meng,et al.  The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm , 2010 .

[20]  M. Okamoto Distinctness of the Eigenvalues of a Quadratic form in a Multivariate Sample , 1973 .

[21]  Rasmus Bro,et al.  Common and distinct components in data fusion , 2016, 1607.02328.

[22]  Joseph P. Romano,et al.  Robust Permutation Tests For Correlation And Regression Coefficients , 2017 .

[23]  Jianqing Fan,et al.  Asymptotics of empirical eigenstructure for high dimensional spiked covariance. , 2017, Annals of statistics.

[24]  Arthur W. Toga,et al.  The Image and Data Archive at the Laboratory of Neuro Imaging , 2016, NeuroImage.

[25]  D. Lawley,et al.  TESTS OF SIGNIFICANCE IN CANONICAL ANALYSIS , 1959 .

[26]  Rank of a quadratic form in an elliptically contoured matrix random variable , 1991 .

[27]  Mark E. Schmidt,et al.  The Alzheimer's Disease Neuroimaging Initiative: A review of papers published since its inception , 2012, Alzheimer's & Dementia.

[28]  Steven J. M. Jones,et al.  Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer , 2015, Cell.

[29]  Tom F. Wilderjans,et al.  Performing DISCO-SCA to search for distinctive and common information in linked data , 2013, Behavior Research Methods.

[30]  Yang Song,et al.  Canonical correlation analysis of high-dimensional data with very small sample support , 2016, Signal Process..

[31]  B. Nadler,et al.  MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA. , 2012, Annals of statistics.

[32]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[33]  Joshua M. Stuart,et al.  Resource Genomic , Pathway Networ k , and Immunologic Features Distinguishing Squamous Carcinomas Graphical , 2018 .

[34]  Essa Yacoub,et al.  The WU-Minn Human Connectome Project: An overview , 2013, NeuroImage.

[35]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[36]  Hanwen Huang,et al.  Asymptotic behavior of Support Vector Machine for spiked population model , 2017, J. Mach. Learn. Res..

[37]  Jianqing Fan,et al.  Large covariance estimation by thresholding principal orthogonal complements , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[38]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[39]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[40]  Z. Bai,et al.  On the limit of the largest eigenvalue of the large dimensional sample covariance matrix , 1988 .

[41]  A. Onatski Determining the Number of Factors from Empirical Distribution of Eigenvalues , 2010, The Review of Economics and Statistics.

[42]  Madeleine Udell,et al.  Why Are Big Data Matrices Approximately Low Rank? , 2017, SIAM J. Math. Data Sci..

[43]  Qihui Chen,et al.  Improved Inference on the Rank of a Matrix , 2018, Quantitative Economics.

[44]  N. Kishore Kumar,et al.  Literature survey on low rank approximation of matrices , 2016, ArXiv.

[45]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[46]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[47]  Hongtu Zhu,et al.  D-CCA: A Decomposition-Based Canonical Correlation Analysis for High-Dimensional Datasets , 2020, Journal of the American Statistical Association.

[48]  Eric F. Lock,et al.  R.JIVE for exploration of multi-source molecular data , 2016, Bioinform..

[49]  Michel van de Velden ON GENERALIZED CANONICAL CORRELATION ANALYSIS , 2011 .

[50]  M. Rothschild,et al.  Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets , 1983 .

[51]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[52]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[53]  Chong-sun Kim Canonical Analysis of Several Sets of Variables , 1973 .

[54]  Christopher L. Asplund,et al.  The organization of the human cerebellum estimated by intrinsic functional connectivity. , 2011, Journal of neurophysiology.

[55]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[56]  Raj Rao Nadakuditi,et al.  Fundamental Limit of Sample Generalized Eigenvalue Based Detection of Signals in Noise Using Relatively Few Signal-Bearing and Noise-Only Samples , 2009, IEEE Journal of Selected Topics in Signal Processing.