Principal component analysis of binary genomics data

Motivation Genome‐wide measurements of genetic and epigenetic alterations are generating more and more high‐dimensional binary data. The special mathematical characteristics of binary data make the direct use of the classical principal component analysis (PCA) model to explore low‐dimensional structures less obvious. Although there are several PCA alternatives for binary data in the psychometric, data analysis and machine learning literature, they are not well known to the bioinformatics community. Results: In this article, we introduce the motivation and rationale of some parametric and nonparametric versions of PCA specifically geared for binary data. Using both realistic simulations of binary data as well as mutation, CNA and methylation data of the Genomic Determinants of Sensitivity in Cancer 1000 (GDSC1000), the methods were explored for their performance with respect to finding the correct number of components, overfit, finding back the correct low‐dimensional structure, variable importance, etc. The results show that if a low‐dimensional structure exists in the data, that most of the methods can find it. When assuming a probabilistic generating process is underlying the data, we recommend to use the parametric logistic PCA model, while when such an assumption is not valid and the data are considered as given, the nonparametric Gifi model is recommended. Availability The codes to reproduce the results in this article are available at the homepage of the Biosystems Data Analysis group (www.bdagroup.nl).

[1]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[2]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[3]  Joachim Selbig,et al.  pcaMethods - a bioconductor package providing PCA methods for incomplete data , 2007, Bioinform..

[4]  H. Kiers,et al.  Three-way methods for the analysis of qualitative and quantitative two-way data. , 1991 .

[5]  Roland L. Dunbrack,et al.  The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics , 2013, PloS one.

[6]  Jianhua Z. Huang,et al.  SPARSE LOGISTIC PRINCIPAL COMPONENTS ANALYSIS FOR BINARY DATA. , 2010, The annals of applied statistics.

[7]  I. Jolliffe Principal Component Analysis , 2002 .

[8]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[9]  J. Uhm Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2009 .

[10]  J. Leeuw,et al.  Simple and Canonical Correspondence Analysis Using the R Package anacor , 2007 .

[11]  Jan de Leeuw,et al.  Gifi Methods for Optimal Scaling in R: The Package homals , 2009 .

[12]  Xihong Lin,et al.  Sparse Principal Component Analysis for Identifying Ancestry‐Informative Markers in Genome‐Wide Association Studies , 2012, Genetic epidemiology.

[13]  Emanuel J. V. Gonçalves,et al.  A Landscape of Pharmacogenomic Interactions in Cancer , 2016, Cell.

[14]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[15]  David S. Cordray,et al.  Similarity and choice : papers in honour of Clyde Coombs , 1982 .

[16]  T. Berge Least squares optimization in multivariate analysis , 2005 .

[17]  M. Hill,et al.  Nonlinear Multivariate Analysis. , 1990 .

[18]  Hans-Åke Scherp,et al.  Quantifying qualitative data using cognitive maps , 2013 .

[19]  Stephen P. Boyd,et al.  Generalized Low Rank Models , 2014, Found. Trends Mach. Learn..

[20]  Lawrence K. Saul,et al.  A Generalized Linear Model for Principal Component Analysis of Binary Data , 2003, AISTATS.

[21]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[22]  Masahiro Kuroda,et al.  Nonlinear Principal Component Analysis and Its Applications , 2016 .

[23]  Iman Hajirasouliha,et al.  Detecting independent and recurrent copy number aberrations using interval graphs , 2014, Bioinform..

[24]  Andrew J. Landgraf,et al.  Generalized Principal Component Analysis: Dimensionality Reduction through the Projection of Natural Parameters , 2015 .

[25]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.