Sparse PCA Asymptotics and Analysis of Tree Data

DAN SHEN: Sparse PCA Asymptotics and Analysis of Tree Data. (Under the direction of J. S. Marron and Haipeng Shen.) This research covers two major areas. The first one is asymptotic properties of Principal Component Analysis (PCA) and sparse PCA. The second one is the application of functional data analysis to tree structured data objects. A general asymptotic framework is developed for studying consistency properties of PCA. Assuming the spike population model, the framework considers increasing sample size, increasing dimension (or the number of variables) and increasing spike sizes (the relative size of the population eigenvalues). Our framework includes several previously studied domains of asymptotics as special cases, and for the first time allows one to investigate interesting connections and transitions among the various domains. This unification provides new theoretical insights. Sparse PCA methods are efficient tools to reduce the dimension (or number of variables) of complex data. Sparse principal components (PCs) can be easier to interpret than conventional PCs, because most loadings are zero. We study the asymptotic properties of these sparse PC directions for scenarios with fixed sample size and increasing dimension (i.e. High Dimension, Low Sample Size (HDLSS)). We find a large set of sparsity assumptions under which sparse PCA is still consistent even when conventional PCA is strongly inconsistent. The consistency of sparse PCA is characterized along with rates of convergence. The boundaries of the consistent region are clarified using an oracle result. Functional data analysis has been very successful in the analysis of data lying in standard Euclidean space, such as curve data. However, with recent developments in fields such as medical image analysis, more and more non-Euclidean spaces, such as tree-structured data, present great challenges to statistical analysis. Here, we use the Dyck path approach from probability theory to build a bridge between tree space and curve space to exploit the power

[1]  M. R. Leadbetter,et al.  Extremes and Related Properties of Random Sequences and Processes: Springer Series in Statistics , 1983 .

[2]  Geert Jan Bex,et al.  A Gaussian scenario for unsupervised learning , 1996 .

[3]  Hani Doss,et al.  Phylogenetic Tree Construction Using Markov Chain Monte Carlo , 2000 .

[4]  B. Nadler Finite sample approximation results for principal component analysis: a matrix perturbation approach , 2009, 0901.3245.

[5]  Susan Holmes,et al.  Statistics for phylogenetic trees. , 2003, Theoretical population biology.

[6]  Jianhua Z. Huang,et al.  Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[7]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[8]  Noureddine El Karoui,et al.  Operator norm consistent estimation of large-dimensional sparse covariance matrices , 2008, 0901.3220.

[9]  J. S. Marron,et al.  Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA , 2012, J. Multivar. Anal..

[10]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[11]  Michael Biehl,et al.  Statistical mechanics of unsupervised structure recognition , 1994 .

[12]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[13]  D. Bosq Linear Processes in Function Spaces: Theory And Applications , 2000 .

[14]  J. Dauxois,et al.  Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference , 1982 .

[15]  S. Geman A Limit Theorem for the Norm of Random Matrices , 1980 .

[16]  F. Wright,et al.  CONVERGENCE AND PREDICTION OF PRINCIPAL COMPONENT SCORES IN HIGH-DIMENSIONAL SETTINGS. , 2012, Annals of statistics.

[17]  I. Johnstone,et al.  Augmented sparse principal component analysis for high dimensional data , 2012, 1202.1242.

[18]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[19]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[20]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[21]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[22]  J. W. Silverstein The Smallest Eigenvalue of a Large Dimensional Wishart Matrix , 1985 .

[23]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[24]  J. Marron,et al.  The high-dimension, low-sample-size geometric representation holds under mild conditions , 2007 .

[25]  Hansheng Wang,et al.  On General Adaptive Sparse Principal Component Analysis , 2008 .

[26]  J. Nadal,et al.  Optimal unsupervised learning , 1994 .

[27]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[28]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[29]  B. Silverman,et al.  Functional Data Analysis , 1997 .

[30]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[31]  Makoto Aoshima,et al.  Effective PCA for high-dimension, low-sample-size data with noise reduction via geometric representations , 2012, J. Multivar. Anal..

[32]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[33]  P. Hall,et al.  On properties of functional principal components analysis , 2006 .

[34]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[35]  J. S. Marron,et al.  A FUNCTIONAL DATA ANALYSIS APPROACH FOR EVALUATING TEMPORAL PHYSIOLOGIC RESPONSES TO PARTICULATE MATTER , 2007 .

[36]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[37]  S. Péché,et al.  Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices , 2004, math/0403022.

[38]  A. Tsybakov,et al.  SPADES AND MIXTURE MODELS , 2009, 0901.2044.

[39]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[40]  Michael I. Jordan,et al.  Union support recovery in high-dimensional multivariate regression , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[41]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[42]  Dan Shen,et al.  Consistency of sparse PCA in High Dimension, Low Sample Size contexts , 2011, J. Multivar. Anal..

[43]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[44]  S. Holmes,et al.  Bootstrapping Phylogenetic Trees: Theory and Methods , 2003 .

[45]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[46]  D. Paul ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMENSIONAL SPIKED COVARIANCE MODEL , 2007 .

[47]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[48]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[49]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[50]  George Casella,et al.  Limit Expressions for the Risk of James-Stein Estimators , 1982 .

[51]  Holger Rootzén,et al.  Extremes and Related Properties of Random Sequences and Processes: Springer Series in Statistics , 1983 .

[52]  T. Auton Applied Functional Data Analysis: Methods and Case Studies , 2004 .

[53]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[54]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[55]  T. E. Harris First passage and recurrence distributions , 1952 .

[56]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[57]  S. Geer,et al.  High-dimensional additive modeling , 2008, 0806.4115.

[58]  Denis Bosq,et al.  Linear Processes in Function Spaces , 2000 .

[59]  Hao Helen Zhang,et al.  Weighted Distance Weighted Discrimination and Its Asymptotic Properties , 2010, Journal of the American Statistical Association.

[60]  J. S. Marron,et al.  Distance-Weighted Discrimination , 2007 .

[61]  Stephen R. Aylward,et al.  Initialization, noise, singularities, and scale in height ridge traversal for tubular object centerline extraction , 2002, IEEE Transactions on Medical Imaging.

[62]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[63]  Milan Sonka,et al.  Knowledge-based segmentation of intrathoracic airways from multidimensional high-resolution CT images , 1994, Medical Imaging.

[64]  K. Mardia,et al.  Statistical Shape Analysis , 1998 .

[65]  J. Marron,et al.  Object oriented data analysis: Sets of trees , 2007, 0711.3147.

[66]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[67]  R. Muirhead Aspects of Multivariate Statistical Theory , 1982, Wiley Series in Probability and Statistics.

[68]  A. Acker,et al.  Absolute continuity of eigenvectors of time-varying operators , 1974 .

[69]  M. A. Girshick On the Sampling Theory of Roots of Determinantal Equations , 1939 .

[70]  Susan Holmes,et al.  Phylogenies: An Overview , 1997 .

[71]  Raj Rao Nadakuditi,et al.  The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices , 2009, 0910.2120.

[72]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[73]  M. Wainwright,et al.  High-dimensional analysis of semidefinite relaxations for sparse principal components , 2008, 2008 IEEE International Symposium on Information Theory.

[74]  T. W. Anderson ASYMPTOTIC THEORY FOR PRINCIPAL COMPONENT ANALYSIS , 1963 .

[75]  C. Dumoulin,et al.  Magnetic resonance angiography. , 1986, Radiology.

[76]  Jeongyoun Ahn,et al.  CLUSTERING HIGH DIMENSION, LOW SAMPLE SIZE DATA USING THE MAXIMAL DATA PILING DISTANCE , 2012 .

[77]  I. Jolliffe Principal Component Analysis , 2002 .

[78]  Calyampudi R. Rao,et al.  Linear statistical inference and its applications , 1965 .

[79]  Martin J. Wainwright,et al.  High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency , 2010, J. Mach. Learn. Res..

[80]  Gabor Pataki,et al.  A Principal Component Analysis for Trees , 2008, 0810.0944.

[81]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[82]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[83]  Magnus Rattray,et al.  PCA learning for sparse high-dimensional data , 2003 .