On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach

An important component of many data mining projects is finding a good classification algorithm, a process that requires very careful thought about experimental design. If not done very carefully, comparative studies of classification and other types of algorithms can easily result in statistically invalid conclusions. This is especially true when one is using data mining techniques to analyze very large databases, which inevitably contain some statistically unlikely data. This paper describes several phenomena that can, if ignored, invalidate an experimental comparison. These phenomena and the conclusions that follow apply not only to classification, but to computational experiments in almost any aspect of data mining. The paper also discusses why comparative analysis is more important in evaluating some types of algorithms than for others, and provides some suggestions about how to avoid the pitfalls suffered by many experimental studies.

[1]  William G. Cochran,et al.  Experimental Designs, 2nd Edition , 1950 .

[2]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[3]  John Bibby,et al.  The Analysis of Contingency Tables , 1978 .

[4]  P. Schönemann On artificial intelligence , 1985, Behavioral and Brain Sciences.

[5]  D. K. Hildebrand Statistical Thinking For Behavioral Scientists , 1986 .

[6]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[7]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[8]  A. Baron Experimental Designs , 1990, The Behavior analyst.

[9]  David Jensen,et al.  Knowledge Discovery Through Induction with Randomization Testing , 1991 .

[10]  David H. Wolpert,et al.  On the Connection between In-sample Testing and Generalization Error , 1992, Complex Syst..

[11]  David W. Aha,et al.  Generalizing from Case studies: A Case Study , 1992, ML.

[12]  Olivier Gascuel,et al.  Statistical Significance in Inductive Learning , 1992, ECAI.

[13]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[14]  Arthur Flexer,et al.  Statistical evaluation of neural networks experiments: Minimum requirements and current practice , 1994 .

[15]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[16]  W. J. H. Verkooijen,et al.  Which method learns most from the data , 1995 .

[17]  Ron Kohavi,et al.  Oblivious Decision Trees, Graphs, and Top-Down Pruning , 1995, IJCAI.

[18]  David D. Jensen,et al.  Overfitting Explained , 1996 .

[19]  Lutz Prechelt,et al.  A quantitative study of experimental evaluations of neural network learning algorithms: Current research practice , 1996, Neural Networks.

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  Pat Langley,et al.  Machine learning as an experimental science , 2004, Machine Learning.

[22]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[23]  Thomas G. Dietterich,et al.  An Experimental Comparison of the Nearest-Neighbor and Nearest-Hyperrectangle Algorithms , 1995, Machine Learning.

[24]  Raymond J. Mooney,et al.  Symbolic and neural learning algorithms: An experimental comparison , 1991, Machine Learning.