Improving Classification Accuracy with Discretization on Datasets Including Continuous Valued Features

This study analyzes the effect of discretization on classification of datasets including continuous valued features. Six datasets from UCI which containing continuous valued features are discretized with entropy-based discretization method. The performance improvement between the dataset with original features and the dataset with discretized features is compared with k-nearest neighbors, Naive Bayes, C4.5 and CN2 data mining classification algorithms. As the result the classification accuracies of the six datasets are improved averagely by 1.71% to 12.31%. Keywords—Data mining classification algorithms, entropy-based discretization method

[1]  Jun'ichi Tsujii,et al.  Improving the performance of dictionary-based approaches in protein name recognition , 2004, J. Biomed. Informatics.

[2]  Shyam Visweswaran,et al.  Improving Classification Performance with Discretization on Biomedical Datasets , 2008, AMIA.

[3]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[4]  Rashmi Data Mining: A Knowledge Discovery Approach , 2012 .

[5]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision , 2008, IEEE Trans. Neural Networks.

[6]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[7]  David Zhang,et al.  Hand-Geometry Recognition Using Entropy-Based Discretization , 2007, IEEE Transactions on Information Forensics and Security.

[8]  Basilis Boutsinas,et al.  A method for improving the accuracy of data mining classification algorithms , 2009, Comput. Oper. Res..

[9]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Ryszard S. Michalski,et al.  On the Quasi-Minimal Solution of the General Covering Problem , 1969 .

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[13]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.