Selecting the number of factors in principal component analysis by permutation testing—Numerical and practical aspects

Selecting the correct number of factors in principal component analysis (PCA) is a critical step to achieve a reasonable data modelling, where the optimal strategy strictly depends on the objective PCA is applied for. In the last decades, much work has been devoted to methods like Kaiser's eigenvalue greater than 1 rule, Velicer's minimum average partial rule, Cattell's scree test, Bartlett's chi‐square test, Horn's parallel analysis, and cross‐validation. However, limited attention has been paid to the possibility of assessing the significance of the calculated components via permutation testing. That may represent a feasible approach in case the focus of the study is discriminating relevant from nonsystematic sources of variation and/or the aforementioned methodologies cannot be resorted to (eg, when the analysed matrices do not fulfill specific properties or statistical assumptions). The main aim of this article is to provide practical insights for an improved understanding of permutation testing, highlighting its pros and cons, mathematically formalising the numerical procedure to be abided by when applying it for PCA factor selection by the description of a novel algorithm developed to this end, and proposing ad hoc solutions for optimising computational time and efficiency.

[1]  Tormod Næs,et al.  Estimating and interpreting more than two consensus components in projective mapping: INDSCAL vs. multiple factor analysis (MFA) , 2017 .

[2]  H. Kaiser The Application of Electronic Computers to Factor Analysis , 1960 .

[3]  Donald A. Jackson,et al.  How many principal components? stopping rules for determining the number of non-trivial axes revisited , 2005, Comput. Stat. Data Anal..

[4]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[5]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[6]  Rasmus Bro,et al.  A phenomenological study of ripening of salted herring. Assessing homogeneity of data from different countries and laboratories , 2002 .

[7]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[8]  J. Horn A rationale and test for the number of factors in factor analysis , 1965, Psychometrika.

[9]  Pedro M. Valero-Mora,et al.  Determining the Number of Factors to Retain in EFA: An easy-to-use computer program for carrying out Parallel Analysis , 2007 .

[10]  Age K. Smilde,et al.  Tracy–Widom statistic for the largest eigenvalue of autoscaled real matrices , 2011 .

[11]  José Camacho,et al.  Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: Practical aspects , 2014 .

[12]  José Camacho,et al.  Observation‐based missing data methods for exploratory data analysis to unveil the connection between observations and variables in latent subspace models , 2011 .

[13]  Alberto Ferrer,et al.  On-The-Fly Processing of continuous high-dimensional data streams , 2017 .

[14]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[15]  W. Velicer,et al.  Comparison of five rules for determining the number of components to retain. , 1986 .

[16]  Richard G. Montanelli,et al.  An Investigation of the Parallel Analysis Criterion for Determining the Number of Common Factors , 1975 .

[17]  José Camacho,et al.  Missing-data theory in the context of exploratory data analysis , 2010 .

[18]  Jesús Picó,et al.  Data understanding with PCA: Structural and Variance Information plots , 2010 .

[19]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[20]  M. Timmerman,et al.  Considering Horn’s Parallel Analysis from a Random Matrix Theory Point of View , 2016, Psychometrika.

[21]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[22]  Eun Sug Park,et al.  Comparing a new algorithm with the classic methods for estimating the number of factors , 1999 .

[23]  Tormod Næs,et al.  A cross-cultural study of preference for apple juice with different sugar and acid contents , 2009 .

[24]  Tormod Næs,et al.  Statistics for Sensory and Consumer Science , 2010 .

[25]  B. Kowalski,et al.  Classification of archaeological artifacts by applying pattern recognition to trace element data , 1972 .

[26]  Karlene A. Kosanovich,et al.  Improved Process Understanding Using Multiway Principal Component Analysis , 1996 .

[27]  Tormod Næs,et al.  Interpretation, validation and segmentation of preference mapping models , 2014 .

[28]  José Camacho,et al.  Cross‐validation in PCA models with the element‐wise k‐fold (ekf) algorithm: theoretical aspects , 2012 .

[29]  M. Bartlett A Note on the Multiplying Factors for Various χ2 Approximations , 1954 .

[30]  Louis W. Glorfeld An Improvement on Horn's Parallel Analysis Methodology for Selecting the Correct Number of Factors to Retain , 1995 .

[31]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[32]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[33]  W. Velicer Determining the number of components from the matrix of partial correlations , 1976 .

[34]  Stéphane Dray,et al.  On the number of principal components: A test of dimensionality based on measurements of similarity between matrices , 2008, Comput. Stat. Data Anal..

[35]  B. Thompson,et al.  Factor Analytic Evidence for the Construct Validity of Scores: A Historical Overview and Some Guidelines , 1996 .

[36]  V. Vieira Permutation tests to estimate significances on Principal Components Analysis , 2012 .

[37]  Tormod Næs,et al.  Statistics for Sensory and Consumer Science: Naes/Statistics for Sensory and Consumer Science , 2010 .

[38]  Edoardo Saccenti,et al.  Determining the number of components in principal components analysis: A comparison of statistical, crossvalidation and approximated methods , 2015 .