Using simulated annealing to optimize the feature selection problem in marketing applications

The feature selection (also, specification) problem is concerned with finding the most influential subset of predictors in predictive modeling from a much larger set of potential predictors that can contain hundreds of predictors. The problem belongs to the realm of combinatorial optimization where the objective is to find the subset of variables that optimize the value of some goodness of fit function. Due to the dimensionality of the problem, the feature selection problem belongs to the group of NP-hard problems. Most of the available predictors are noisy or redundant and add very little, if any, to the prediction power of the model. Using all the predictors in the model often results in strong over-fitting and very poor predictions. Constructing a prediction model by checking out all possible subsets is impractical due to computational volume. Looking on the contribution of each predictor separately is not accurate because it ignores the inter-correlations between predictors. As a result, no analytic solution is available for the feature selection problem, requiring that one resorts to heuristics. In this paper we employ the simulated annealing (SA) approach, which is one of the leading stochastic search methods, for specifying a large-scale linear regression model. The SA results are compared to the results of the more common stepwise regression (SWR) approach for model specification. The models are applied on realistic data sets in database marketing. We also use simulated data sets to investigate what data characteristics make the SWR approach equivalent to the supposedly more superior SA approach.

[1]  Clifford M. Hurvich,et al.  Regression and time series model selection in small samples , 1989 .

[2]  A. Atkinson Subset Selection in Regression , 1992 .

[3]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[4]  Chih-Ling Tsai,et al.  A crossvalidatory AIC for hard wavelet thresholding in spatially adaptive function estimation , 1998 .

[5]  J. L. Hodges,et al.  The significance probability of the smirnov two-sample test , 1958 .

[6]  Emile H. L. Aarts,et al.  Simulated Annealing: Theory and Applications , 1987, Mathematics and Its Applications.

[7]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[8]  Dean Phillips Foster,et al.  Calibration and Empirical Bayes Variable Selection , 1997 .

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[12]  Jan M. Zytkow,et al.  Handbook of Data Mining and Knowledge Discovery , 2002 .

[13]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[14]  B. Vidakovic,et al.  Bayesian Inference in Wavelet-Based Models , 1999 .

[15]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[16]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[17]  El-Ghazali Talbi,et al.  Parallel GA-based wrapper feature selection for spectroscopic data mining , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[18]  Chang C. Y. Dorea,et al.  Simple conditions for the convergence of simulated annealing type algorithms , 1998, Journal of Applied Probability.

[19]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[20]  Xiaodong Zheng,et al.  A CONSISTENT VARIABLE SELECTION CRITERION FOR LINEAR MODELS WITH HIGH-DIMENSIONAL COVARIATES , 1997 .

[21]  H. Akaike A new look at the statistical model identification , 1974 .

[22]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[23]  Jeffrey C. Schlimmer,et al.  Efficiently Inducing Determinations: A Complete and Systematic Search Algorithm that Uses Optimal Pruning , 1993, ICML.

[24]  Jianming Ye On Measuring and Correcting the Effects of Data Mining and Model Selection , 1998 .

[25]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[26]  Günter Rudolph,et al.  Convergence analysis of canonical genetic algorithms , 1994, IEEE Trans. Neural Networks.

[27]  Jack Sklansky,et al.  On Automatic Feature Selection , 1988, Int. J. Pattern Recognit. Artif. Intell..

[28]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[29]  Claire Cardie,et al.  Using Decision Trees to Improve Case-Based Learning , 1993, ICML.

[30]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[31]  Edward I. George,et al.  Empirical Bayes Estimation in Wavelet Nonparametric Regression , 1999 .

[32]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[33]  B. G. Quinn,et al.  The determination of the order of an autoregression , 1979 .

[34]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[35]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[36]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[37]  M. Clyde,et al.  Flexible empirical Bayes estimation for wavelets , 2000 .

[38]  Dean P. Foster,et al.  Calibration and empirical Bayes variable selection , 2000 .

[39]  P. Lambert The Distribution and Redistribution of Income , 1989 .

[40]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[41]  Bernard W. Silverman,et al.  Empirical Bayes approaches to mixture problems and wavelet regression , 1998 .

[42]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[43]  Alan J. Miller Subset Selection in Regression , 1992 .

[44]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[45]  A. Federgruen,et al.  Simulated annealing methods with general acceptance probabilities , 1987, Journal of Applied Probability.

[46]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[47]  C. Z. Wei On Predictive Least Squares Principles , 1992 .

[48]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[49]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[50]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[51]  R. Tibshirani,et al.  The Covariance Inflation Criterion for Adaptive Model Selection , 1999 .

[52]  A. E. Eiben,et al.  Global Convergence of Genetic Algorithms: A Markov Chain Analysis , 1990, PPSN.

[53]  P. Davies,et al.  Local Extremes, Runs, Strings and Multiresolution , 2001 .