Principal Component Analysis for Exponential Family Data

This chapter reviews exponential family principal component analysis (ePCA), a family of statistical methods for dimension reduction of large-scale data that are not real-valued, such as user ratings for items in e-commerce, categorical/count genetic data in bioinformatics, and digital images in computer vision. The ePCA framework extends the applications of traditional PCA to modern data containing various data types. A sparse version of ePCA further helps overcome the model inconsistency and improve interpretability when applied to high-dimensional data. Model formulations and solution strategies of ePCA and sparse ePCA are discussed with real-world applications.

[1]  D. Hunter,et al.  A Tutorial on MM Algorithms , 2004 .

[2]  I. Jolliffe Principal Component Analysis , 2002 .

[3]  Y. She,et al.  Sparse Generalized Principal Component Analysis for Large-scale Applications beyond Gaussianity , 2015, 1512.03883.

[4]  Xiaoning Qian,et al.  Logistic Principal Component Analysis for Rare Variants in Gene-Environment Interaction Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Jianhua Z. Huang,et al.  SPARSE LOGISTIC PRINCIPAL COMPONENTS ANALYSIS FOR BINARY DATA. , 2010, The annals of applied statistics.

[6]  Xiaoning Qian,et al.  Supervised logistic principal component analysis for pathway based genome-wide association studies , 2012, BCB.

[7]  David P. Wipf,et al.  Iterative Reweighted 1 and 2 Methods for Finding Sparse Solutions , 2010, IEEE J. Sel. Top. Signal Process..

[8]  Michael I. Jordan,et al.  Bayesian parameter estimation via variational methods , 2000, Stat. Comput..

[9]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[10]  Xiaoning Qian,et al.  Sparse exponential family Principal Component Analysis , 2016, Pattern Recognit..

[11]  K. Fan On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations: II. , 1949, Proceedings of the National Academy of Sciences of the United States of America.

[12]  X. Chen,et al.  Pathway‐based analysis for genome‐wide association studies using supervised principal components , 2010, Genetic epidemiology.

[13]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[14]  Wotao Yin,et al.  A feasible method for optimization with orthogonality constraints , 2013, Math. Program..

[15]  Yoonkyung Lee,et al.  Generalized Principal Component Analysis: Projection of Saturated Model Parameters , 2019, Technometrics.

[16]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[17]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[18]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[19]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[20]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[21]  Y. She,et al.  Robust Orthogonal Complement Principal Component Analysis , 2014, 1410.1173.

[22]  Yoonkyung Lee,et al.  Dimensionality reduction for binary data through the projection of natural parameters , 2015, J. Multivar. Anal..

[23]  Y. She,et al.  Thresholding-based iterative selection procedures for model selection and shrinkage , 2008, 0812.5061.

[24]  B. Nadler Finite sample approximation results for principal component analysis: a matrix perturbation approach , 2009, 0901.3245.

[25]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[26]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[27]  P. McCullagh,et al.  An outline of generalized linear models , 1983 .

[28]  Seokho Lee,et al.  A coordinate descent MM algorithm for fast computation of sparse logistic PCA , 2013, Comput. Stat. Data Anal..

[29]  D. Paul ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMENSIONAL SPIKED COVARIANCE MODEL , 2007 .

[30]  Dale Schuurmans,et al.  Efficient global optimization for exponential family PCA and low-rank matrix factorization , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[31]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[32]  Xiaoning Qian,et al.  Supervised categorical principal component analysis for genome-wide association analyses , 2014, BMC Genomics.

[33]  David J. Kriegman,et al.  From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Jan de Leeuw,et al.  Principal component analysis of binary data by iterated singular value decomposition , 2006, Comput. Stat. Data Anal..

[35]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .