论文信息 - Non-Disjoint Discretization for Naive-Bayes Classifiers

Non-Disjoint Discretization for Naive-Bayes Classifiers

Previous discretization techniques have discretized numeric attributes into disjoint intervals. We argue that this is neither necessary nor appropriate for naive-Bayes classifiers. The analysis leads to a new discretization method, Non-Disjoint Discretization (NDD). NDD forms overlapping intervals for a numeric attribute, always locating a value toward the middle of an interval to obtain more reliable probability estimation. It also adjusts the number and size of discretized intervals to the number of training instances, seeking an appropriate trade-off between bias and variance of probability estimation. We justify NDD in theory and test it on a wide cross-section of datasets. Our experimental results suggest that for naiveBayes classifiers, NDD works better than alternative discretization approaches.

Geoffrey I. Webb | Ying Yang | Ying Yang

[1] Bojan Cestnik,et al. Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[2] Jason Catlett,et al. On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[3] Pat Langley,et al. Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[4] G. K. Bhattacharyya,et al. Statistics: Principles and Methods , 1994 .

[5] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[6] Usama M. Fayyad,et al. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[7] Chun-Nan Hsu,et al. Why Discretization Works for Naive Bayesian Classifiers , 2000, ICML.

[8] Pedro M. Domingos,et al. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[9] Geoffrey I. Webb,et al. Proportional k-Interval Discretization for Naive-Bayes Classifiers , 2001, ECML.

[10] Geoffrey I. Webb,et al. MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[11] Richard L. Scheaffer,et al. Probability and statistics for engineers , 1986 .

[12] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[13] Ron Kohavi,et al. Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[14] Yiming Yang,et al. A re-examination of text categorization methods , 1999, SIGIR '99.