Sparse CCA via Precision Adjusted Iterative Thresholding

Sparse Canonical Correlation Analysis (CCA) has received considerable attention in high-dimensional data analysis to study the relationship between two sets of random variables. However, there has been remarkably little theoretical statistical foundation on sparse CCA in high-dimensional settings despite active methodological and applied research activities. In this paper, we introduce an elementary sufficient and necessary characterization such that the solution of CCA is indeed sparse, propose a computationally efficient procedure, called CAPIT, to estimate the canonical directions, and show that the procedure is rate-optimal under various assumptions on nuisance parameters. The procedure is applied to a breast cancer dataset from The Cancer Genome Atlas project. We identify methylation probes that are associated with genes, which have been previously characterized as prognosis signatures of the metastasis of breast cancer.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[3]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[4]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[5]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[6]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[7]  R. Salunga,et al.  Gene expression profiles of human breast cancer progression , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[9]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[10]  C. Perou,et al.  Molecular portraits and 70-gene prognosis signature are preserved throughout the metastatic process of breast cancer. , 2005, Cancer research.

[11]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[12]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[13]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[14]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[15]  Alfred O. Hero,et al.  A greedy approach to sparse canonical correlation analysis , 2008, 0801.2748.

[16]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[17]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[18]  Xiaorong Gao,et al.  An online multi-channel SSVEP-based brain–computer interface using a canonical correlation analysis method , 2009, Journal of neural engineering.

[19]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[20]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[21]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[22]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[23]  Aeilko H. Zwinderman,et al.  Sparse canonical correlation analysis for identifying, connecting and completing gene-expression networks , 2009, BMC Bioinformatics.

[24]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[25]  Harrison H. Zhou,et al.  Optimal rates of convergence for covariance matrix estimation , 2010, 1010.3866.

[26]  John Shawe-Taylor,et al.  Sparse canonical correlation analysis , 2009, Machine Learning.

[27]  Jiashun Jin,et al.  Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing , 2010, 1001.1609.

[28]  Brian B. Avants,et al.  Dementia induces correlated reductions in white matter integrity and cortical thickness: A multivariate neuroimaging study with sparse canonical correlation analysis , 2010, NeuroImage.

[29]  Xiao Zhang,et al.  Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis , 2010, BMC Bioinformatics.

[30]  Dan Yang,et al.  A Sparse SVD Method for High-dimensional Data , 2011, 1112.2433.

[31]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[32]  T. Cai,et al.  A Constrained ℓ1 Minimization Approach to Sparse Precision Matrix Estimation , 2011, 1102.2233.

[33]  Richard P. Bagozzi,et al.  Measurement and Meaning in Information Systems and Organizational Research: Methodological and Philosophical Foundations , 2011, MIS Q..

[34]  M. Yuan,et al.  Adaptive covariance matrix estimation through block thresholding , 2012, 1211.0459.

[35]  Harrison H. Zhou,et al.  OPTIMAL RATES OF CONVERGENCE FOR SPARSE COVARIANCE MATRIX ESTIMATION , 2012, 1302.3030.

[36]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[37]  T. Cai,et al.  Sparse PCA: Optimal rates and adaptive estimation , 2012, 1211.1309.

[38]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[39]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[40]  Harrison H. Zhou,et al.  Optimal rates of convergence for estimating Toeplitz covariance matrices , 2013 .

[41]  B. Nadler,et al.  MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA. , 2012, Annals of statistics.

[42]  Nathan D. VanderKraats,et al.  Discovering high-resolution patterns of differential DNA methylation that correlate with gene expression changes , 2013, Nucleic acids research.

[43]  Harrison H. Zhou,et al.  Sparse CCA: Adaptive Estimation and Computational Barriers , 2014, 1409.8565.

[44]  Harrison H. Zhou,et al.  Asymptotic normality and optimalities in estimation of large Gaussian graphical models , 2013, 1309.6024.