论文信息 - Choosing Between Two Learning Algorithms Based on Calibrated Tests

Choosing Between Two Learning Algorithms Based on Calibrated Tests

Designing a hypothesis test to determine the best of two machine learning algorithms with only a small data set available is not a simple task. Many popular tests suffer from low power (5×2 cv [2]), or high Type I error (Weka's 10×10 cross validation [11]). Furthermore, many tests show a low level of replicability, so that tests performed by different scientists with the same pair of algorithms, the same data sets and the same hypothesis test still may present different results. We show that 5×2 cv, resampling and 10 fold cv suffer from low replicability. The main complication is due to the need to use the data multiple times. As a consequence, independence assumptions for most hypothesis tests are violated. In this paper, we pose the case that reuse of the same data causes the effective degrees of freedom to be much lower than theoretically expected. We show how to calibrate the effective degrees of freedom empirically for various tests. Some tests are not calibratable, indicating another flaw in the design. However the ones that are calibratable all show very similar behavior. Moreover, the Type I error of those tests is on the mark for a wide range of circumstances, while they show a power and replicability that is a considerably higher than currently popular hypothesis tests.

Remco R. Bouckaert | R. Bouckaert

[1] William Mendenhall,et al. Introduction to Probability and Statistics , 1961, The Mathematical Gazette.

[2] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[3] Alberto Maria Segre,et al. Programs for Machine Learning , 1994 .

[4] Pat Langley,et al. Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[5] Ron Kohavi,et al. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[6] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[7] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[8] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9] W. J. DeCoursey,et al. Introduction: Probability and Statistics , 2003 .

[10] Steven Salzberg,et al. On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[11] Yoshua Bengio,et al. Inference for the Generalization Error , 1999, Machine Learning.

[12] Paul R. Cohen,et al. Multiple Comparisons in Induction Algorithms , 2000, Machine Learning.