A greedy algorithm for supervised discretization

We present a greedy algorithm for supervised discretization using a metric defined on the space of partitions of a set of objects. This proposed technique is useful for preparing the data for classifiers that require nominal attributes. Experimental work on decision trees and naïve Bayes classifiers confirm the efficacy of the proposed algorithm.

[1]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[2]  Igor Kononenko,et al.  Inductive and Bayesian learning in medical diagnosis , 1993, Appl. Artif. Intell..

[3]  Pierre A. Devijver,et al.  Entropie Quadratique et Reconnaissance Des Formes , 1976 .

[4]  Szymon Jaroszewicz,et al.  An axiomatization of partition entropy , 2002, IEEE Trans. Inf. Theory.

[5]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[8]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[11]  Geoffrey I. Webb,et al.  Proportional k-Interval Discretization for Naive-Bayes Classifiers , 2001, ECML.

[12]  Keki B. Irani,et al.  Multi-interval discretization of continuos attributes as pre-processing for classi cation learning , 1993, IJCAI 1993.

[13]  Geoffrey I. Webb,et al.  Proportional k-Interval Discretization for Naive-Bayes Classifiers , 2001, ECML.

[14]  Zoltán Daróczy,et al.  Generalized Information Functions , 1970, Inf. Control..

[15]  Griffin M. Weber,et al.  Classification of gene expression data using fuzzy logic , 2002, J. Intell. Fuzzy Syst..

[16]  Ramón López de Mántaras,et al.  Proposal and Empirical Comparison of a Parallelizable Distance-Based Discretization Method , 1997, KDD.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[18]  Szymon Jaroszewicz,et al.  Generalized Conditional Entropy and Decision Trees , 2003, EGC.

[19]  U. Fayyad On the induction of decision trees for multiple concept learning , 1991 .