Canonical Correlation Analysis for Multilabel Classification: A Least-Squares Formulation, Extensions, and Analysis

Canonical Correlation Analysis (CCA) is a well-known technique for finding the correlations between two sets of multidimensional variables. It projects both sets of variables onto a lower-dimensional space in which they are maximally correlated. CCA is commonly applied for supervised dimensionality reduction in which the two sets of variables are derived from the data and the class labels, respectively. It is well-known that CCA can be formulated as a least-squares problem in the binary class case. However, the extension to the more general setting remains unclear. In this paper, we show that under a mild condition which tends to hold for high-dimensional data, CCA in the multilabel case can be formulated as a least-squares problem. Based on this equivalence relationship, efficient algorithms for solving least-squares problems can be applied to scale CCA to very large data sets. In addition, we propose several CCA extensions, including the sparse CCA formulation based on the 1-norm regularization. We further extend the least-squares formulation to partial least squares. In addition, we show that the CCA projection for one set of variables is independent of the regularization on the other set of multidimensional variables, providing new insights on the effect of regularization on CCA. We have conducted experiments using benchmark data sets. Experiments on multilabel data sets confirm the established equivalence relationships. Results also demonstrate the effectiveness and efficiency of the proposed CCA extensions.

[1]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[2]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[3]  Gert R. G. Lanckriet,et al.  Sparse eigen methods by D.C. programming , 2007, ICML '07.

[4]  Hans-Peter Kriegel,et al.  Multi-Output Regularized Feature Projection , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[7]  Jean-Philippe Vert,et al.  Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA , 2002, NIPS.

[8]  D. Sorensen Numerical methods for large eigenvalue problems , 2002, Acta Numerica.

[9]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[10]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[11]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[12]  Karl J. Friston,et al.  Characterizing the Response of PET and fMRI Data Using Multivariate Linear Models , 1997, NeuroImage.

[13]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[14]  M. Barker,et al.  Partial least squares for discrimination , 2003 .

[15]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[16]  David S. Watkins,et al.  Fundamentals of matrix computations , 1991 .

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[19]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[20]  H. Damasio,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence: Special Issue on Perceptual Organization in Computer Vision , 1998 .

[21]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[22]  David R. Hardoon,et al.  Semantic models for machine learning , 2006 .

[23]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[24]  R. Tibshirani,et al.  Penalized Discriminant Analysis , 1995 .

[25]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[26]  Michael A. Saunders,et al.  LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares , 1982, TOMS.

[27]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[28]  Eisaku Maeda,et al.  Maximal Margin Labeling for Multi-Topic Text Categorization , 2004, NIPS.

[29]  Jieping Ye,et al.  Least squares linear discriminant analysis , 2007, ICML '07.

[30]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[31]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[32]  David R. Hardoon,et al.  KCCA for different level precision in content-based image retrieval , 2003 .

[33]  Gene H. Golub,et al.  Matrix computations , 1983 .

[34]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[35]  M. Ashburner,et al.  Systematic determination of patterns of gene expression during Drosophila embryogenesis , 2002, Genome Biology.