MODL: A Bayes optimal discretization method for continuous attributes

While real data often comes in mixed format, discrete and continuous, many supervised induction algorithms require discrete data. Efficient discretization of continuous attributes is an important problem that has effects on speed, accuracy and understandability of the induction models. In this paper, we propose a new discretization method MODL1, founded on a Bayesian approach. We introduce a space of discretization models and a prior distribution defined on this model space. This results in the definition of a Bayes optimal evaluation criterion of discretizations. We then propose a new super-linear optimization algorithm that manages to find near-optimal discretizations. Extensive comparative experiments both on real and synthetic data demonstrate the high inductive performances obtained by the new discretization method.

[1]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[2]  R. Rakotomalala Graphes d'induction , 1997 .

[3]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[4]  Sabine Loudcher,et al.  FUSINTER: A Method for Discretization of Continuous Attributes , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[5]  Ivan Bratko,et al.  Experiments in automatic learning of medical diagnostic rules , 1984 .

[6]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[7]  Marc Boullé,et al.  Khiops: A Statistical Discretization Method of Continuous Attributes , 2004, Machine Learning.

[8]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[9]  Tapio Elomaa,et al.  Finding Optimal Multi-Splits for Numerical Attributes in Decision Tree Learning , 1996 .

[10]  Ming Li,et al.  Minimum description length induction, Bayesianism, and Kolmogorov complexity , 1999, IEEE Trans. Inf. Theory.

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Tapio Elomaa,et al.  General and Efficient Multisplitting of Numerical Attributes , 1999, Machine Learning.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[15]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[16]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[17]  Marc Boullé Khiops: A Discretization Method of Continuous Attributes with Guaranteed Resistance to Noise , 2003, MLDM.

[18]  Xindong Wu,et al.  Discretization Methods , 2010, Data Mining and Knowledge Discovery Handbook.

[19]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[20]  Simon Kasif,et al.  Efficient Algorithms for Finding Multi-way Splits for Decision Trees , 1995, ICML.

[21]  Stephen D. Bay Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[22]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[23]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[24]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[25]  Jean-Marie Bouroche,et al.  Analyse des données multidimensionnelles , 1977 .

[26]  Gennady Agre,et al.  On Supervised and Unsupervised Discretization , 2007 .