Exact and approximate algorithms for variable selection in linear discriminant analysis

Variable selection is a venerable problem in multivariate statistics. In the context of discriminant analysis, the goal is to select a subset of variables that accomplishes one of two objectives: (1) the provision of a parsimonious, yet descriptive, representation of group structure, or (2) the ability to correctly allocate new cases to groups. We present an exact (branch-and-bound) algorithm for variable selection in linear discriminant analysis that identifies subsets of variables that minimize Wilks' @L. An important feature of this algorithm is a variable reordering scheme that greatly reduces computation time. We also present an approximate procedure based on tabu search, which can be implemented for a variety of objective criteria designed for either the descriptive or allocation goals associated with discriminant analysis. The tabu search heuristic is especially useful for maximizing the hit ratio (i.e., the percentage of correctly classified cases). Computational results for the proposed methods are provided for two data sets from the literature.

[1]  Yuichi Mori,et al.  Statistical methods for biostatistics and related fields , 2007 .

[2]  Pierre Hansen,et al.  Variable Neighborhood Search , 2018, Handbook of Heuristics.

[3]  António Pedro Duarte Silva Efficient Variable Screening for Multivariate Analysis , 2001 .

[4]  F. Glover,et al.  Handbook of Metaheuristics , 2019, International Series in Operations Research & Management Science.

[5]  Akira Harada,et al.  Stepwise variable selection in factor analysis , 2000 .

[6]  Ian T. Jolliffe,et al.  DALASS: Variable selection in discriminant analysis via the LASSO , 2007, Comput. Stat. Data Anal..

[7]  J. Orestes Cerdeira,et al.  Computational aspects of algorithms for variable selection in the context of principal components , 2004, Comput. Stat. Data Anal..

[8]  J. Roy,et al.  STEP-DOWN PROCEDURE IN MULTIVARIATE ANALYSIS , 1958 .

[9]  Geoffrey J. McLachlan,et al.  Criterion for Selecting Variables for Linear Discriminant Function , 1976 .

[10]  P. Robert,et al.  A Unifying Tool for Linear Multivariate Statistical Methods: The RV‐Coefficient , 1976 .

[11]  Wojtek J. Krzanowski,et al.  ON SELECTING VARIABLES AND ASSESSING THEIR PERFORMANCE IN LINEAR DISCRIMINANT ANALYSIS , 1989 .

[12]  H. Akaike A new look at the statistical model identification , 1974 .

[13]  Carl J. Huberty,et al.  Applied MANOVA and discriminant analysis , 2006 .

[14]  G. D. Murray A Cautionary Note on Selection of Variables in Discriminant Analysis , 1977 .

[15]  Masaya Iizuka,et al.  Variable selection in principal component analysis , 2007 .

[16]  V. Urbakh,et al.  Linear Discriminant Analysis: Loss of Discriminating Power When a Variate is Omitted , 1971 .

[17]  Michael J. Brusco,et al.  An Exact Algorithm for Hierarchically Well-Formulated Subsets in Second-Order Polynomial Regression , 2009, Technometrics.

[18]  C. A. Smith Some examples of discrimination. , 1947, Annals of eugenics.

[19]  Michael J. Brusco,et al.  Neighborhood search heuristics for selecting hierarchically well‐formulated subsets in polynomial regression , 2010 .

[20]  J. Peixoto Hierarchical Variable Selection in Polynomial Regression Models , 1987 .

[21]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[22]  M. Kendall A course in multivariate analysis , 1958 .

[23]  M. Kendall,et al.  The discarding of variables in multivariate analysis. , 1967, Biometrika.

[24]  Stanley P. Azen,et al.  Computational Statistics and Data Analysis (CSDA) , 2006 .

[25]  N. Campbell,et al.  Variable selection techniques in discriminant analysis: I. Description , 1982 .

[26]  J. S. Russell,et al.  Multivariate-Covariance and Canonical Analysis: A Method for Selecting the Most Effective Discriminators in a Multivariate Situation , 1968 .

[27]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[28]  A. Atkinson Subset Selection in Regression , 1992 .

[29]  N. Campbell,et al.  Variable selection techniques in discriminant analysis: II. Allocation , 1982 .

[30]  M. Brusco,et al.  Branch-and-Bound Applications in Combinatorial Data Analysis , 2005 .

[31]  Carl J. Huberty,et al.  Issues in the use and interpretation of discriminant analysis. , 1984 .

[32]  R. Fisher THE STATISTICAL UTILIZATION OF MULTIPLE MEASUREMENTS , 1938 .

[33]  M. Brusco,et al.  Selection of Variables in Cluster Analysis: An Empirical Comparison of Eight Procedures , 2008 .

[34]  Joaquín A. Pacheco,et al.  Analysis of new variable selection methods for discriminant analysis , 2006, Comput. Stat. Data Anal..

[35]  Ian T. Jolliffe,et al.  Variable selection for discriminant analysis of fish sounds using matrix correlations , 2005 .

[36]  J D Knoke,et al.  Estimation of error rates in discriminant analysis with selection of variables. , 1989, Biometrics.

[37]  David J. Hand,et al.  Discrimination and Classification , 1982 .

[38]  Dean M. Young,et al.  A Non-Parametric Variable Selection Algorithm for Allocatory Linear Discriminant Analysis , 1990 .

[39]  C. E. McHenry,et al.  Computation of a Best Subset in Multivariate Analysis , 1978 .

[40]  M. Brusco,et al.  Variable Neighborhood Search Heuristics for Selecting a Subset of Variables in Principal Component Analysis , 2009 .

[41]  David J. Hand,et al.  A simple method for screening variables before clustering microarray data , 2009, Comput. Stat. Data Anal..

[42]  Joaquín A. Pacheco,et al.  A variable selection method based on Tabu search for logistic regression models , 2009, Eur. J. Oper. Res..

[43]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[44]  I. Jolliffe Principal Component Analysis , 2002 .

[45]  G. M. Furnival All Possible Regressions with Less Computation , 1971 .

[46]  Yuichi Mori,et al.  Variable selection in multivariate methods using global score estimation , 2009, Comput. Stat..

[47]  Robert W. Wilson,et al.  Regressions by Leaps and Bounds , 2000, Technometrics.

[48]  Antonie Stam,et al.  Nontraditional approaches to statistical classification: Some perspectives on L_p-norm methods , 1997, Ann. Oper. Res..

[49]  Geoffrey J. McLachlan,et al.  Selection of Variables in Discriminant-Analysis , 1980 .

[50]  Pablo Moscato,et al.  A Gentle Introduction to Memetic Algorithms , 2003, Handbook of Metaheuristics.

[51]  O. J. Dunn,et al.  Elimination of variates in linear discrimination problems. , 1966, Biometrics.

[52]  Yasunori Fujikoshi,et al.  A criterion for variable selection in multiple discriminant analysis , 1983 .

[53]  G. McCabe Computations for Variable Selection in Discriminant Analysis , 1975 .

[54]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[55]  Jun Zhu,et al.  A set of nonlinear regression models for animal movement in response to a single landscape feature , 2005 .

[56]  John M. Ferron,et al.  Selection of variables in exploratory factor analysis: An empirical comparison of a stepwise and traditional approach , 2004 .

[57]  Zvi Drezner,et al.  Tabu search model selection in multiple regression analysis , 1999 .

[58]  C. Reeves Modern heuristic techniques for combinatorial problems , 1993 .

[59]  Erricos John Kontoghiorghes,et al.  A branch and bound algorithm for computing the best subset regression models , 2002 .

[60]  M. Brusco,et al.  A variable-selection heuristic for K-means clustering , 2001 .