Correlating multiple SNPs and multiple disease phenotypes: penalized non-linear canonical correlation analysis

MOTIVATION Canonical correlation analysis (CCA) can be used to capture the underlying genetic background of a complex disease, by associating two datasets containing information about a patient's phenotypical and genetic details. Often the genetic information is measured on a qualitative scale, consequently ordinary CCA cannot be applied to such data. Moreover, the size of the data in genetic studies can be enormous, thereby making the results difficult to interpret. RESULTS We developed a penalized non-linear CCA approach that can deal with qualitative data by transforming each qualitative variable into a continuous variable through optimal scaling. Additionally, sparse results were obtained by adapting soft-thresholding to this non-linear version of the CCA. By means of simulation studies, we show that our method is capable of extracting relevant variables out of high-dimensional sets. We applied our method to a genetic dataset containing 144 patients with glial cancer. CONTACT s.waaijenborg@amc.uva.nl.

[1]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[2]  David Tritchler,et al.  Genome-wide sparse canonical correlation of gene expression with genotypes , 2007, BMC proceedings.

[3]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[4]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[5]  Forrest W. Young,et al.  Additive structure in qualitative data: An alternating least squares method with optimal scaling features , 1976 .

[6]  H. Wold Path Models with Latent Variables: The NIPALS Approach , 1975 .

[7]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[8]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[9]  J. Leeuw,et al.  Nonlinear Canonical Correlation Analysis with k Sets of Variables. Research Report 87-8. , 1987 .

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  Z. John Daye,et al.  Shrinkage and model selection with correlated variables via weighted fusion , 2009, Comput. Stat. Data Anal..

[12]  Jacob A. Wegelin,et al.  A Survey of Partial Least Squares (PLS) Methods, with Emphasis on the Two-Block Case , 2000 .

[13]  Fredrik Lindgren,et al.  Alternative Partial Least-Squares (PLS) Algorithms , 1998 .

[14]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[15]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[16]  Robert J Tibshirani,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[17]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[18]  J. Meulman,et al.  Prediction accuracy and stability of regression optimal scaling transformations , 1996 .

[19]  Isabella Morlini,et al.  On Multicollinearity and Concurvity in Some Nonlinear Multivariate Models , 2006, Stat. Methods Appl..

[20]  Yuri Kotliarov,et al.  High-resolution global genomic survey of 178 gliomas reveals novel regions of copy number alteration and allelic imbalances. , 2006, Cancer research.

[21]  Aeilko H Zwinderman,et al.  Penalized canonical correlation analysis to quantify the association between gene expression and DNA markers , 2007, BMC proceedings.

[22]  Jan de Leeuw,et al.  Non-linear canonical correlation , 1983 .

[23]  Forrest W. Young,et al.  Regression with qualitative and quantitative variables: An alternating least squares method with optimal scaling features , 1976 .

[24]  Bhupinder S. Dayal,et al.  Improved PLS algorithms , 1997 .

[25]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.