A variable selection method based on Tabu search for logistic regression models

A Tabu search method is proposed and analysed for selecting variables that are subsequently used in Logistic Regression Models. The aim is to find from among a set of m variables a smaller subset which enables the efficient classification of cases. Reducing dimensionality has some very well-known advantages that are summarized in literature. The specific problem consists in finding, for a small integer value of p, a subset of size p of the original set of variables that yields the greatest percentage of hits in Logistic Regression. The proposed Tabu search method performs a deep search in the solution space that alternates between a basic phase (that uses simple moves) and a diversification phase (to explore regions not previously visited). Testing shows that it obtains significantly better results than the Stepwise, Backward or Forward methods used by classic statistical packages. Some results of applying these methods are presented.

[1]  M. Tada,et al.  Gene-Expression Profile Changes Correlated with Tumor Progression and Lymph Node Metastasis in Esophageal Cancer , 2004, Clinical Cancer Research.

[2]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[3]  Bruce Thompson,et al.  Advances in Social Science Methodology , 1994 .

[4]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[5]  Pedro Larrañaga,et al.  Feature subset selection by genetic algorithms and estimation of distribution algorithms - A case study in the survival of cirrhotic patients treated with TIPS , 2001, Artif. Intell. Medicine.

[6]  Andrew W. Moore,et al.  Logistic regression for data mining and high-dimensional classification , 2004 .

[7]  Jihoon Yang,et al.  Prediction of Molecular Bioactivity for Drug Design Using a Decision Tree Algorithm , 2003, Discovery Science.

[8]  Erricos John Kontoghiorghes,et al.  Efficient strategies for deriving the subset VAR models , 2005, Comput. Manag. Sci..

[9]  Autores varios,et al.  Tesis de maestria , 2009 .

[10]  Pedro Larrañaga,et al.  Prototype Selection and Feature Subset Selection by Estimation of Distribution Algorithms. A Case Study in the Survival of Cirrhotic Patients Treated with TIPS , 2001, AIME.

[11]  Erricos John Kontoghiorghes,et al.  A branch and bound algorithm for computing the best subset regression models , 2002 .

[12]  Panos M. Pardalos,et al.  Handbook of applied optimization , 2002 .

[13]  Paul H. C. Eilers,et al.  Classification of microarray data with penalized logistic regression , 2001, SPIE BiOS.

[14]  W. P. Dixon,et al.  BMPD statistical software manual , 1988 .

[15]  Jerzy W. Bala,et al.  Using Learning to Facilitate the Evolution of Features for Recognizing Visual Concepts , 1996, Evolutionary Computation.

[16]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[17]  Andrew W. Moore,et al.  Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs , 2003, AISTATS.

[18]  W. Jaschke,et al.  Automated melanoma recognition , 2001, IEEE Transactions on Medical Imaging.

[19]  Anthony Ralston,et al.  Mathematical Methods for Digital Computers , 1960 .

[20]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[21]  Belén Melián-Batista,et al.  Solving feature subset selection problem by a Parallel Scatter Search , 2006, Eur. J. Oper. Res..

[22]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[23]  Asoke K. Nandi,et al.  Automatic digital modulation recognition using artificial neural network and genetic algorithm , 2004, Signal Process..

[24]  G. Bortolan,et al.  The problem of linguistic approximation in clinical decision making , 1988, Int. J. Approx. Reason..

[25]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[26]  Erricos John Kontoghiorghes,et al.  Parallel algorithms for computing all possible subset regression models using the QR decomposition , 2003, Parallel Comput..

[27]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[28]  Joaquín A. Pacheco,et al.  Analysis of new variable selection methods for discriminant analysis , 2006, Comput. Stat. Data Anal..

[29]  R. Stolzenberg,et al.  Multiple Regression Analysis , 2004 .

[30]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[31]  Pedro Larrañaga,et al.  Feature subset selection by Bayesian networks: a comparison with genetic and sequential algorithms , 2001, Int. J. Approx. Reason..

[32]  Philip M. Lewis,et al.  The characteristic selection problem in recognition systems , 1962, IRE Trans. Inf. Theory.

[33]  Pedro Larrañaga,et al.  Feature Subset Selection by Bayesian network-based optimization , 2000, Artif. Intell..

[34]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  George S. Sebestyen,et al.  Decision-making processes in pattern recognition , 1962 .

[36]  Thomas P. Minka,et al.  Algorithms for maximum-likelihood logistic regression , 2003 .

[37]  Jacob Zahavi,et al.  Using simulated annealing to optimize the feature selection problem in marketing applications , 2006, Eur. J. Oper. Res..

[38]  Ponnuthurai N. Suganthan,et al.  Feature Analysis and Classification of Protein Secondary Structure Data , 2003, ICANN.

[39]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[40]  Pablo Moscato,et al.  Evolutionary Search of Thresholds for Robust Feature Set Selection: Application to the Analysis of Microarray Data , 2004, EvoWorkshops.

[41]  R F Woolson,et al.  On the efficacy of the rank transformation in stepwise logistic and discriminant analysis. , 1993, Statistics in medicine.