SPARSE LOGISTIC PRINCIPAL COMPONENTS ANALYSIS FOR BINARY DATA.

We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from penalized Bernoulli likelihood. A Majorization-Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[3]  W. Ewens,et al.  The transmission/disequilibrium test: history, subdivision, and admixture. , 1995, American journal of human genetics.

[4]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[5]  D. Nickerson,et al.  Increasing the information content of STS-based genome maps: identifying polymorphisms in mapped STSs. , 1996, Genomics.

[6]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[7]  Dankmar Böhning,et al.  The lower bound method in probit regression , 1999 .

[8]  Xiao-Li Meng,et al.  [Optimization Transfer Using Surrogate Objective Functions]: Discussion , 2000 .

[9]  D. Hunter,et al.  Optimization Transfer Using Surrogate Objective Functions , 2000 .

[10]  Michael I. Jordan,et al.  Bayesian parameter estimation via variational methods , 2000, Stat. Comput..

[11]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[12]  Hua Tang,et al.  Categorization of humans in biomedical research: genes, race and disease , 2002, Genome Biology.

[13]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[14]  Lawrence K. Saul,et al.  A Generalized Linear Model for Principal Component Analysis of Binary Data , 2003, AISTATS.

[15]  W. Wong,et al.  Detect and adjust for population stratification in population-based association study using genomic control markers: an application of Affymetrix Genechip® Human Mapping 10K array , 2004, European Journal of Human Genetics.

[16]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[17]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[18]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[19]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[20]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[21]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[22]  Jan de Leeuw,et al.  Principal component analysis of binary data by iterated singular value decomposition , 2006, Comput. Stat. Data Anal..

[23]  Arpad Kelemen,et al.  Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases , 2008, 0803.4065.

[24]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[25]  Thomas J. Hudson,et al.  Correction of Population Stratification in Large Multi-Ethnic Association Studies , 2008, PloS one.

[26]  Seokho Lee,et al.  Principal components analysis for binary data , 2009 .

[27]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[28]  Douglas C. Montgomery,et al.  The Generalized Linear Model , 2012 .