This study attempts to identify the merits of six of the most popular discretization methods when confronted with a randomly generated dataset consisting of attributes that conform to one of eight common statistical distributions. It is hoped that the analysis will enlighten as to a heuristic which identifies the most appropriate discretization method to be applied, given some preliminary analysis or visualization to determine the type of statistical distribution of the attribute to be discretized. Further, the comparative effectiveness of discretization given each data distribution is a primary focus. Analysis of the data was accomplished by inducing a decision tree classifier (C4.5) on the discretized data and an error measure was used to determine the relative value of discretization. The experiments showed that the method of discretization and the level of inherent error placed in the class attribute has a major impact on classification errors generated post-discretization. More importantly, the general effectiveness of discretization varies significantly depending on the shape of data distribution considered. Distributions that are highly skewed or have high peaks tend to result in higher classification errors, and the relative superiority of supervised discretization over unsupervised discretization is diminished significantly when applied to these data distributions.
[1]
Jason Catlett,et al.
On Changing Continuous Attributes into Ordered Discrete Attributes
,
1991,
EWSL.
[2]
J. Ross Quinlan,et al.
C4.5: Programs for Machine Learning
,
1992
.
[3]
Ron Kohavi,et al.
Supervised and Unsupervised Discretization of Continuous Features
,
1995,
ICML.
[4]
Ian H. Witten,et al.
Data mining: practical machine learning tools and techniques with Java implementations
,
2002,
SGMD.
[5]
Michael J. Pazzani,et al.
An Iterative Improvement Approach for the Discretization of Numeric Attributes in Bayesian Classifiers
,
1995,
KDD.
[6]
Usama M. Fayyad,et al.
Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning
,
1993,
IJCAI.
[7]
Paul D. Scott,et al.
Forming Categories in Exploratory Data Analysis and Data Mining
,
1997,
IDA.
[8]
Ron Kohavi,et al.
Error-Based and Entropy-Based Discretization of Continuous Features
,
1996,
KDD.
[9]
Robert C. Holte,et al.
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets
,
1993,
Machine Learning.