Boundary behavior in High Dimension, Low Sample Size asymptotics of PCA

In High Dimension, Low Sample Size (HDLSS) data situations, where the dimension d is much larger than the sample size n, principal component analysis (PCA) plays an important role in statistical analysis. Under which conditions does the sample PCA well reflect the population covariance structure? We answer this question in a relevant asymptotic context where d grows and n is fixed, under a generalized spiked covariance model. Specifically, we assume the largest population eigenvalues to be of the order d^@a, where @a 1. Earlier results show the conditions for consistency and strong inconsistency of eigenvectors of the sample covariance matrix. In the boundary case, @a=1, where the sample PC directions are neither consistent nor strongly inconsistent, we show that eigenvalues and eigenvectors do not degenerate but have limiting distributions. The result smoothly bridges the phase transition represented by the other two cases, and thus gives a spectrum of limits for the sample PCA in the HDLSS asymptotics. While the results hold under a general situation, the limiting distributions under Gaussian assumption are illustrated in greater detail. In addition, the geometric representation of HDLSS data is extended to give three different representations, that depend on the magnitude of variances in the first few principal components.

[1]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[2]  Makoto Aoshima,et al.  PCA Consistency for Non-Gaussian Data in High Dimension, Low Sample Size Context , 2009 .

[3]  G. Stewart,et al.  Matrix Perturbation Theory , 1990 .

[4]  R. C. Bradley Basic Properties of Strong Mixing Conditions , 1985 .

[5]  Makoto Aoshima,et al.  Effective PCA for high-dimension, low-sample-size data with singular value decomposition of cross data matrix , 2010, J. Multivar. Anal..

[6]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Gene H. Golub,et al.  Matrix computations , 1983 .

[8]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[9]  A. Acker,et al.  Absolute continuity of eigenvectors of time-varying operators , 1974 .

[10]  Luigi Salmaso,et al.  Finite-sample consistency of combination-based permutation tests with application to repeated measures designs , 2010 .

[11]  D. Paul ASYMPTOTICS OF SAMPLE EIGENSTRUCTURE FOR A LARGE DIMENSIONAL SPIKED COVARIANCE MODEL , 2007 .

[12]  Hao Helen Zhang,et al.  Weighted Distance Weighted Discrimination and Its Asymptotic Properties , 2010, Journal of the American Statistical Association.

[13]  R. Muirhead Aspects of Multivariate Statistical Theory , 1982, Wiley Series in Probability and Statistics.

[14]  George Casella,et al.  Limit Expressions for the Risk of James-Stein Estimators , 1982 .

[15]  B. Nadler Finite sample approximation results for principal component analysis: a matrix perturbation approach , 2009, 0901.3245.

[16]  F. Wright,et al.  CONVERGENCE AND PREDICTION OF PRINCIPAL COMPONENT SCORES IN HIGH-DIMENSIONAL SETTINGS. , 2012, Annals of statistics.

[17]  B. M. Brown,et al.  Permutation Tests for Complex Data: Theory, Applications and Software by F. Pesarin and L. Salmaso , 2012 .

[18]  J. Marron,et al.  The high-dimension, low-sample-size geometric representation holds under mild conditions , 2007 .

[19]  Noureddine El Karoui Spectrum estimation for large dimensional covariance matrices using random matrix theory , 2006, math/0609418.

[20]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[21]  Cedric E. Ginestet Spectral Analysis of Large Dimensional Random Matrices, 2nd edn , 2012 .

[22]  R. C. Bradley Basic properties of strong mixing conditions. A survey and some open questions , 2005, math/0511078.

[23]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[24]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[25]  A. Kolmogorov,et al.  On Strong Mixing Conditions for Stationary Gaussian Processes , 1960 .

[26]  Travis L. Gaydos,et al.  Data representation and basis selection to understand variation of function valued traits , 2008 .

[27]  J. Marron,et al.  Bidirectional discrimination with application to data visualization. , 2012, Biometrika.

[28]  Makoto Aoshima,et al.  PCA consistency for the power spiked model in high-dimensional settings , 2013, J. Multivar. Anal..

[29]  Luigi Salmaso,et al.  Permutation Tests for Complex Data , 2010 .