Comparison of penalty functions for sparse canonical correlation analysis

Canonical correlation analysis (CCA) is a widely used multivariate method for assessing the association between two sets of variables. However, when the number of variables far exceeds the number of subjects, such in the case of large-scale genomic studies, the traditional CCA method is not appropriate. In addition, when the variables are highly correlated the sample covariance matrices become unstable or undefined. To overcome these two issues, sparse canonical correlation analysis (SCCA) for multiple data sets has been proposed using a Lasso type of penalty. However, these methods do not have direct control over sparsity of solution. An additional step that uses Bayesian Information Criterion (BIC) has also been suggested to further filter out unimportant features. In this paper, a comparison of four penalty functions (Lasso, Elastic-net, SCAD and Hard-threshold) for SCCA with and without the BIC filtering step have been carried out using both real and simulated genotypic and mRNA expression data. This study indicates that the SCAD penalty with BIC filter would be a preferable penalty function for application of SCCA to genomic data.

[1]  R. Morgan Genetics and molecular biology. , 1995, Current opinion in lipidology.

[2]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[3]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[4]  H. Zou,et al.  Addendum: Regularization and variable selection via the elastic net , 2005 .

[5]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[6]  A. Antoniadis Wavelets in statistics: A review , 1997 .

[7]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[8]  H. Vinod Canonical ridge and econometrics of joint production , 1976 .

[9]  Anastasia Lykou,et al.  Sparse CCA using a Lasso with positivity constraints , 2010, Comput. Stat. Data Anal..

[10]  Xuming He,et al.  Dimension reduction based on constrained canonical correlation and variable filtering , 2008, 0808.0977.

[11]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[12]  I. Johnstone,et al.  Wavelet Shrinkage: Asymptopia? , 1995 .

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Nadia Lalam,et al.  Statistical Applications in Genetics and Molecular Biology , 2007 .

[15]  Alessandro Rinaldo,et al.  Characterization of multilocus linkage disequilibrium , 2005, Genetic epidemiology.

[16]  K. S. Banerjee,et al.  A Comment on Ridge Regression. Biased Estimation for Non-Orthogonal Problems , 1971 .

[17]  Jianqing Fan,et al.  Comments on «Wavelets in statistics: A review» by A. Antoniadis , 1997 .

[18]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[19]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[20]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[21]  Krishna R. Kalari,et al.  Gemcitabine and cytosine arabinoside cytotoxicity: association with lymphoblastoid cell expression. , 2008, Cancer research.

[22]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[23]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[24]  Krishna R. Kalari,et al.  Gemcitabine and Arabinosylcytosin Pharmacogenomics: Genome-Wide Association and Drug Response Biomarkers , 2009, PloS one.

[25]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[26]  Jianqing Fan,et al.  COMMENTS ON « WAVELETS IN STATISTICS : A REVIEW , 2009 .

[27]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[28]  Krishna R. Kalari,et al.  Radiation pharmacogenomics: a genome-wide association approach to identify radiation response biomarkers using human lymphoblastoid cell lines. , 2010, Genome research.

[29]  David V Conti,et al.  Testing association between disease and multiple SNPs in a candidate gene , 2007, Genetic epidemiology.