Variable Neighborhood Search Heuristics for Selecting a Subset of Variables in Principal Component Analysis

The selection of a subset of variables from a pool of candidates is an important problem in several areas of multivariate statistics. Within the context of principal component analysis (PCA), a number of authors have argued that subset selection is crucial for identifying those variables that are required for correct interpretation of the components. In this paper, we adapt the variable neighborhood search (VNS) paradigm to develop two heuristics for variable selection in PCA. The performances of these heuristics were compared to those obtained by a branch-and-bound algorithm, as well as forward stepwise, backward stepwise, and tabu search heuristics. In the first experiment, which considered candidate pools of 18 to 30 variables, the VNS heuristics matched the optimal subset obtained by the branch-and-bound algorithm more frequently than their competitors. In the second experiment, which considered candidate pools of 54 to 90 variables, the VNS heuristics provided better solutions than their competitors for a large percentage of the test problems. An application to a real-world data set is provided to demonstrate the importance of variable selection in the context of PCA.

[1]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[2]  A. Land,et al.  An Automatic Method for Solving Discrete Programming Problems , 1960, 50 Years of Integer Programming.

[3]  E. Balas An Additive Algorithm for Solving Linear Programs with Zero-One Variables , 1965 .

[4]  M. Kendall,et al.  The discarding of variables in multivariate analysis. , 1967, Biometrika.

[5]  Ian T. Jolliffe,et al.  Discarding Variables in a Principal Component Analysis. I: Artificial Data , 1972 .

[6]  A. M. Geoffrion,et al.  Integer Programming Algorithms: A Framework and State-of-the-Art Survey , 1972 .

[7]  I. Jolliffe Discarding Variables in a Principal Component Analysis. Ii: Real Data , 1973 .

[8]  Y. Escoufier LE TRAITEMENT DES VARIABLES VECTORIELLES , 1973 .

[9]  H. Kaiser An index of factorial simplicity , 1974 .

[10]  G. McCabe Computations for Variable Selection in Discriminant Analysis , 1975 .

[11]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[12]  Peter M. Bentler,et al.  Factor simplicity index and transformations , 1977 .

[13]  N. Campbell,et al.  Variable selection techniques in discriminant analysis: I. Description , 1982 .

[14]  N. Campbell,et al.  Variable selection techniques in discriminant analysis: II. Allocation , 1982 .

[15]  G. Diehr Evaluation of a Branch and Bound Algorithm for Clustering , 1985 .

[16]  Ellen Boekkooi-Timminga,et al.  A Zero-One Programming Approach to Guiliksen's Matched Random Subtests Method , 1988 .

[17]  W. Krzanowski Selection of Variables to Preserve Multivariate Data Structure, Using Principal Components , 1987 .

[18]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[19]  Emile H. L. Aarts,et al.  Simulated annealing and Boltzmann machines - a stochastic approach to combinatorial optimization and neural computing , 1990, Wiley-Interscience series in discrete mathematics and optimization.

[20]  A. Atkinson Subset Selection in Regression , 1992 .

[21]  C. Reeves Modern heuristic techniques for combinatorial problems , 1993 .

[22]  Jorge Cadima Departamento de Matematica Loading and correlations in the interpretation of principle compenents , 1995 .

[23]  Pierre Hansen,et al.  Variable Neighborhood Search , 2018, Handbook of Heuristics.

[24]  J. Aaker,et al.  Dimensions of Brand Personality , 1997 .

[25]  P. Hansen,et al.  Variable neighborhood search for the p-median , 1997 .

[26]  Yutaka Tanaka,et al.  Principal component analysis based on a subset of variables: variable selection and sensitivity analysis , 1997 .

[27]  Zvi Drezner,et al.  Tabu search model selection in multiple regression analysis , 1999 .

[28]  Akira Harada,et al.  Stepwise variable selection in factor analysis , 2000 .

[29]  George M. Furnival,et al.  Regressions by leaps and bounds , 2000 .

[30]  M. Brusco,et al.  A variable-selection heuristic for K-means clustering , 2001 .

[31]  António Pedro Duarte Silva Efficient Variable Screening for Multivariate Analysis , 2001 .

[32]  Ian T. Jolliffe,et al.  Variable selection and the interpretation of principal subspaces , 2001 .

[33]  I. Jolliffe Principal Component Analysis , 2002 .

[34]  A. Pedro Duarte Silva,et al.  Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons , 2002, Comput. Stat..

[35]  M. Brusco A branch-and-bound algorithm for fitting anti-robinson structures to symmetric dissimilarity matrices , 2002 .

[36]  Masaya Iiduka COMPUTER INTENSIVE TRIALS TO DETERMINE THE NUMBER OF VARIABLES IN PCA , 2003 .

[37]  Masaya Iizuka,et al.  9. Multidimensional Data Analysis , 2003 .

[38]  John M. Ferron,et al.  Selection of variables in exploratory factor analysis: An empirical comparison of a stepwise and traditional approach , 2004 .

[39]  J. Orestes Cerdeira,et al.  Computational aspects of algorithms for variable selection in the context of principal components , 2004, Comput. Stat. Data Anal..

[40]  Donald A. Jackson,et al.  How many principal components? stopping rules for determining the number of non-trivial axes revisited , 2005, Comput. Stat. Data Anal..

[41]  Michael J. Brusco,et al.  Optimal Least-Squares Unidimensional Scaling: Improved Branch-and-Bound Procedures and Comparison to Dynamic Programming , 2005 .

[42]  Jorge R. Vera,et al.  Improving the efficiency of the Branch and Bound algorithm for integer programming based on "flatness" information , 2006, Eur. J. Oper. Res..

[43]  M. Brusco A Repetitive Branch-and-Bound Procedure for Minimum Within-Cluster Sums of Squares Partitioning , 2006, Psychometrika.

[44]  Rafael Martí,et al.  Variable neighborhood search for the linear ordering problem , 2006, Comput. Oper. Res..

[45]  J. Berge,et al.  Tucker's congruence coefficient as a meaningful index of factor similarity. , 2006 .

[46]  Miguel A. Lejeune,et al.  Production , Manufacturing and Logistics A variable neighborhood decomposition search method for supply chain management planning problems , 2006 .

[47]  Yuichi Mori,et al.  Statistical methods for biostatistics and related fields , 2007 .

[48]  M. Brusco,et al.  A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning , 2007 .

[49]  Masaya Iizuka,et al.  Variable selection in principal component analysis , 2007 .

[50]  M. Brusco,et al.  Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures , 2008 .

[51]  Pierre Hansen,et al.  Variable neighborhood search , 1997, Eur. J. Oper. Res..

[52]  Edmund K. Burke,et al.  A hybrid heuristic ordering and variable neighbourhood search for the nurse rostering problem , 2004, Eur. J. Oper. Res..

[53]  Volker Gruhn,et al.  A General Vehicle Routing Problem , 2008, Eur. J. Oper. Res..

[54]  Stéphane Dray,et al.  On the number of principal components: A test of dimensionality based on measurements of similarity between matrices , 2008, Comput. Stat. Data Anal..

[55]  Yuichi Mori,et al.  Variable selection in multivariate methods using global score estimation , 2009, Comput. Stat..

[56]  Michael J. Brusco,et al.  An Exact Algorithm for Hierarchically Well-Formulated Subsets in Second-Order Polynomial Regression , 2009, Technometrics.

[57]  F. Glover,et al.  Handbook of Metaheuristics , 2019, International Series in Operations Research & Management Science.