Goodness-of-Fit Measures for Induction Trees

This paper is concerned with the goodness-of-fit of induced decision trees. Namely, we explore the possibility to measure the goodness-of-fit as it is classically done in statistical modeling. We show how Chi-square statistics and especially the Log-likelihood Ratio statistic that is abundantly used in the modeling of cross tables, can be adapted for induction trees. The Log-likelihood Ratio is well suited for testing the significance of the difference between two nested trees. In addition, we derive from it pseudo R 2’s. We propose also adapted forms of the Akaike (AIC) and Bayesian (BIC) information criteria that prove useful in selecting the best compromise model between fit and complexity.

[1]  B. Margolin,et al.  An Analysis of Variance for Categorical Data , 1971 .

[2]  S. J. Press,et al.  Review: Yvonne M. M. Bishop, Stephen E. Fienberg and Paul W. Holland, Discrete multivariate analysis: Theory and practice , 1978 .

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[5]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[6]  G. Ritschard,et al.  The Behavior of Nominal and Ordinal Partial Association Measures , 1995 .

[7]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[10]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[11]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[12]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[13]  Leo A. Goodman,et al.  Corrigenda: Measures of Association for Cross Classifications , 1957 .

[14]  H. Theil On the Estimation of Relationships Involving Qualitative Variables , 1970, American Journal of Sociology.

[15]  Henri Theil,et al.  Economics and information theory , 1967 .

[16]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[17]  L. A. Goodman,et al.  Measures of Association for Cross Classifications, IV: Simplification of Asymptotic Variances , 1972 .

[18]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[19]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[20]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .