On the number of principal components: A test of dimensionality based on measurements of similarity between matrices

An important problem in principal component analysis (PCA) is the estimation of the correct number of components to retain. PCA is most often used to reduce a set of observed variables to a new set of variables of lower dimensionality. The choice of this dimensionality is a crucial step for the interpretation of results or subsequent analyses, because it could lead to a loss of information (underestimation) or the introduction of random noise (overestimation). New techniques are proposed to evaluate the dimensionality in PCA. They are based on similarity measurements, singular value decomposition and permutation procedures. A simulation study is conducted to evaluate the relative merits of the proposed approaches. Results showed that one method based on the RV coefficient is very accurate and seems to be more efficient than other existing approaches.

[1]  Jean Thioulouse,et al.  CO‐INERTIA ANALYSIS AND THE LINKING OF ECOLOGICAL DATA TABLES , 2003 .

[2]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[3]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[4]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[5]  J. Daudin,et al.  Stability of principal component analysis studied by the bootstrap method , 1988 .

[6]  Jean Thioulouse,et al.  Procrustean co-inertia analysis for the linking of multivariate datasets , 2003 .

[7]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[8]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[9]  Jean-Paul Benzecri,et al.  STATISTICAL ANALYSIS AS A TOOL TO MAKE PATTERNS EMERGE FROM DATA , 1969 .

[10]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[11]  Atanu R. Sinha,et al.  Assessing the stability of principal components using regression , 1995 .

[12]  D. Kendall,et al.  Mathematics in the Archaeological and Historical Sciences , 1971, The Mathematical Gazette.

[13]  Satosi Watanabe,et al.  Methodologies of Pattern Recognition , 1969 .

[14]  John C. Gower,et al.  Statistical methods of comparing different multivariate analyses of the same data , 1971 .

[15]  Ph. Besse,et al.  Application of Resampling Methods to the Choice of Dimension in Principal Component Analysis , 1993 .

[16]  Y. Escoufier LE TRAITEMENT DES VARIABLES VECTORIELLES , 1973 .

[17]  Jean Thioulouse,et al.  The ade4 package - I : One-table methods , 2004 .

[18]  Elaine B. Martin,et al.  On principal component analysis in L 1 , 2002 .

[19]  A. L. V. D. Wollenberg Redundancy analysis an alternative for canonical correlation analysis , 1977 .

[20]  P. H. A. Sneath Mathematics in the Archaeological and Historical Sciences , 1972 .

[21]  Michel Tenenhaus,et al.  An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data , 1985 .

[22]  Donald A. Jackson,et al.  How many principal components? stopping rules for determining the number of non-trivial axes revisited , 2005, Comput. Stat. Data Anal..

[23]  Peter H. Schönemann,et al.  Alternative measures of fit for the Schönemann-carroll matrix fitting algorithm , 1974 .

[24]  Donald A. Jackson STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS: A COMPARISON OF HEURISTICAL AND STATISTICAL APPROACHES' , 1993 .

[25]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[26]  Philippe C. Besse PCA stability and choice of dimensionality , 1992 .

[27]  G. C. McDonald,et al.  Instabilities of Regression Estimates Relating Air Pollution to Mortality , 1973 .

[28]  S. Dolédec,et al.  Co‐inertia analysis: an alternative method for studying species–environment relationships , 1994 .

[29]  I. J. Good,et al.  Some Applications of the Singular Decomposition of a Matrix , 1969 .

[30]  Léopold Simar,et al.  Computer Intensive Methods in Statistics , 1994 .

[31]  W. Velicer,et al.  The Effects of Overextraction on Factor and Component Analysis. , 1992, Multivariate behavioral research.

[32]  L. Ferré Selection of components in principal component analysis: a comparison of methods , 1995 .