A methodology for comparing classification methods through the assessment of model stability and validity in variable selection

Classification analysis utilizes features for separating observations into distinct groups for decision-making purposes. This study provides a systematic design for comparing the performance of six classification methods using Monte Carlo simulations and illustrates that the variable selection process is integral in comparing methodologies to ensure minimal bias, enhanced stability, and optimize performance. We quantify the variable selection bias and show that, for sufficiently large samples, this bias is minimized so that methods can be compared. We address topics relevant to model building and provide prescriptions for future comparisons so as to build a body of evidence for recommending their use.

[1]  Geoffrey J. McLachlan,et al.  Expected Error Rates for Logistic Regression Versus Normal Discriminant Analysis , 1979 .

[2]  Ramesh Sharda,et al.  Bankruptcy prediction using neural networks , 1994, Decis. Support Syst..

[3]  M. Halperin,et al.  Estimation of the multivariate logistic risk function: a comparison of the discriminant function and maximum likelihood approaches. , 1971, Journal of chronic diseases.

[4]  Donald H. Foley Considerations of sample and feature size , 1972, IEEE Trans. Inf. Theory.

[5]  Dongsong Zhang,et al.  Predicting and explaining patronage behavior toward web and traditional stores using neural networks: a comparative analysis with logistic regression , 2006, Decis. Support Syst..

[6]  P. Lachenbruch On Expected Probabilities of Misclassification in Discriminant Analysis, Necessary Sample Size, and a Relation with the Multiple Correlation Coefficient , 1968 .

[7]  W. Z. Liu,et al.  A comparison of nearest neighbour and tree-based methods of non-parametric discriminant analysis , 1995 .

[8]  Jeanne G. Harris,et al.  Competing on Analytics: The New Science of Winning , 2007 .

[9]  Wojtek J. Krzanowski,et al.  A comparison of discriminant procedures for binary variables , 2002 .

[10]  Sung C. Choi,et al.  Choice of the smoothing parameter and efficiency of k-nearest neighbor classification , 1986 .

[11]  S.J Steel,et al.  A comparison of the post selection error rate behaviour of the normal linear and quadratic discriminant rules , 2000 .

[12]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[13]  Huan Liu,et al.  Feature subset selection bias for classification learning , 2006, ICML.

[14]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[15]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[16]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[17]  Bruce Thompson,et al.  Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply here: A Guidelines Editorial , 1995 .

[18]  C. J. Huberty,et al.  Applied Discriminant Analysis , 1994 .

[19]  William Nick Street,et al.  An intelligent system for customer targeting: a data mining approach , 2004, Decis. Support Syst..

[20]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[21]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[22]  YongSeog Kim,et al.  Toward a successful CRM: variable selection, sampling, and ensemble , 2006, Decis. Support Syst..

[23]  J. Habbema,et al.  Selection of Variables in Discriminant Analysis by F-statistic and Error Rate , 1977 .

[24]  Lloyd D. Fisher,et al.  A comparison of three methods of estimating the logistic regression coefficients , 1983 .

[25]  Marvin Karson,et al.  The effect of unequal priors and unequal misclassification costs on MDA , 1988 .

[26]  Lloyd D. Fisher,et al.  A comparison of the maximum likelihood and discriminant function estimators of the coefficients of the logistic regression model for mixed continuous and discrete variables , 1983 .

[27]  Eric L. Dey,et al.  Statistical alternatives for studying college student retention: A comparative analysis of logit, probit, and linear regression , 1993 .

[28]  Paul Resnick,et al.  Recommender systems , 1997, CACM.

[29]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[30]  Michael Y. Hu,et al.  A principled approach for building and evaluating neural network classification models , 2004, Decis. Support Syst..

[31]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[32]  A. Baron,et al.  Misclassification among methods used for multiple group discrimination--the effects of distributional properties. , 1991, Statistics in medicine.

[33]  John D. Morris,et al.  A Method for Selecting between Linear and Quadratic Classification Models in Discriminant Analysis. , 1995 .

[34]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[35]  Siddhartha Bhattacharyya,et al.  Data mining for credit card fraud: A comparative study , 2011, Decis. Support Syst..

[36]  Jennifer Neville,et al.  Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning , 2002, ICML.

[37]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[38]  J. Si,et al.  An Error Rate Comparison of Classification Methods with Continuous Explanatory Variables , 2003 .

[39]  W. Holmes Finch,et al.  Misclassification Rates for Four Methods of Group Classification , 2006 .

[40]  Melody Y. Kiang,et al.  A comparative assessment of classification methods , 2003, Decis. Support Syst..

[41]  Nissan Levin,et al.  AMOS — A probability-driven, customer-oriented decision support system for target marketing of solo mailings , 1995 .

[42]  D J Hand,et al.  Common errors in data analysis: the apparent error rate of classification rules. , 1983, Psychological medicine.

[43]  N. Campbell,et al.  Variable selection techniques in discriminant analysis: I. Description , 1982 .

[44]  Geoffrey J. McLachlan,et al.  The bias of sample based posterior probabilities , 1977 .