A simple data discretizer

Data discretization is an important step in the process of machine learning, since it is easier for classifiers to deal with discrete attributes rather than continuous attributes. Over the years, several methods of performing discretization such as Boolean Reasoning, Equal Frequency Binning, Entropy have been proposed, explored, and implemented. In this article, a simple supervised discretization approach is introduced. The prime goal of MIL is to maximize classification accuracy of classifier, minimizing loss of information while discretization of continuous attributes. The performance of the suggested approach is compared with the supervised discretization algorithm Minimum Information Loss (MIL), using the state-of-the-art rule inductive algorithms- J48 (Java implementation of C4.5 classifier). The presented approach is, indeed, the modified version of MIL. The empirical results show that the modified approach performs better in several cases in comparison to the original MIL algorithm and Minimum Description Length Principle (MDLP) .

[1]  Tony R. Martinez,et al.  An Empirical Comparison of Discretization Methods , 1995 .

[2]  Peter Auer,et al.  Theory and Applications of Agnostic PAC-Learning with Small Decision Trees , 1995, ICML.

[3]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[4]  Himika Biswas,et al.  SPID4.7: Discretization Using Successive Pseudo Deletion at Maximum Information Gain Boundary Points , 2005, SDM.

[5]  R. E. Lee,et al.  Distribution-free multiple comparisons between successive treatments , 1995 .

[6]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[7]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[8]  Huan Liu,et al.  Feature Selection via Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[9]  Chidanand Apté,et al.  Predicting Equity Returns from Securities Data , 1996, Advances in Knowledge Discovery and Data Mining.

[10]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[11]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[12]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[13]  Tapio Elomaa,et al.  General and Efficient Multisplitting of Numerical Attributes , 1999, Machine Learning.

[14]  Andrew K. C. Wong,et al.  Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[16]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[17]  Jerzy W. Grzymala-Busse,et al.  Global discretization of continuous attributes as preprocessing for machine learning , 1996, Int. J. Approx. Reason..