A greedy approach to sparse canonical correlation analysis

We consider the problem of sparse canonical correlation analysis (CCA), i.e., the search for two linear combinations, one for each multivariate, that yield maximum correlation using a specified number of variables. We propose an efficient numerical approximation based on a direct greedy approach which bounds the correlation at each stage. The method is specifically designed to cope with large data sets and its computational complexity depends only on the sparsity levels. We analyze the algorithm's performance through the tradeoff between correlation and parsimony. The results of numerical simulation suggest that a significant portion of the correlation may be captured using a relatively small number of variables. In addition, we examine the use of sparse CCA as a regularization method when the number of available samples is small compared to the dimensions of the multivariates.

[1]  H. Knutsson,et al.  A Unified Approach to PCA, PLS, MLR and CCA , 1997 .

[2]  Shai Avidan,et al.  Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms , 2005, NIPS.

[3]  John Shawe-Taylor,et al.  Sparse canonical correlation analysis , 2009, Machine Learning.

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Colin Fyfe,et al.  Two Methods for Sparsifying Probabilistic Canonical Correlation Analysis , 2006, ICONIP.

[6]  Colin Fyfe,et al.  Sparse Kernel Canonical Correlation Analysis , 2001, ESANN.

[7]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[8]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[9]  Alexandre d'Aspremont,et al.  Full regularization path for sparse principal component analysis , 2007, ICML '07.

[10]  Gert R. G. Lanckriet,et al.  Sparse eigen methods by D.C. programming , 2007, ICML '07.

[11]  B. Thompson Canonical Correlation Analysis: Uses and Interpretation , 1984 .

[12]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[13]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[14]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[15]  David Tritchler,et al.  Genome-wide sparse canonical correlation of gene expression with genotypes , 2007, BMC proceedings.

[16]  Bernhard Schölkopf,et al.  Kernel Methods for Measuring Independence , 2005, J. Mach. Learn. Res..

[17]  H. Vinod Canonical ridge and econometrics of joint production , 1976 .

[18]  Aeilko H Zwinderman,et al.  Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers , 2007, BMC proceedings.