Optimum simultaneous discretization with data grid models in supervised classification: a Bayesian model selection approach

In the domain of data preparation for supervised classification, filter methods for variable ranking are time efficient. However, their intrinsic univariate limitation prevents them from detecting redundancies or constructive interactions between variables. This paper introduces a new method to automatically, rapidly and reliably extract the classificatory information of a pair of input variables. It is based on a simultaneous partitioning of the domains of each input variable, into intervals in the numerical case and into groups of categories in the categorical case. The resulting input data grid allows to quantify the joint information between the two input variables and the output variable. The best joint partitioning is searched by maximizing a Bayesian model selection criterion. Intensive experiments demonstrate the benefits of the approach, especially the significant improvement of accuracy for classification tasks.

[1]  Marc Boullé A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes , 2005, J. Mach. Learn. Res..

[2]  Marc Boullé,et al.  Khiops: A Statistical Discretization Method of Continuous Attributes , 2004, Machine Learning.

[3]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[4]  J. Berger The case for objective Bayesian analysis , 2006 .

[5]  Wolfgang Maass,et al.  Efficient agnostic PAC-learning with simple hypothesis , 1994, COLT '94.

[6]  Stephen D. Bay Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[7]  Henry Rouanet,et al.  Analyse des données multidimensionnelles : statistique en sciences humaines , 1993 .

[8]  Marc Boullé,et al.  MODL: A Bayes optimal discretization method for continuous attributes , 2006, Machine Learning.

[9]  Gérard Govaert,et al.  Block Clustering of Contingency Table and Mixture Model , 2005, IDA.

[10]  Djamel A. Zighed,et al.  Decision trees with optimal joint partitioning , 2005, Int. J. Intell. Syst..

[11]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[12]  R. Rakotomalala Graphes d'induction , 1997 .

[13]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[14]  Gilbert Ritschard,et al.  Aggregation and Association in Cross Tables , 2000, PKDD.

[15]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[16]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[17]  Michael Goldstein,et al.  Subjective Bayesian Analysis: Principles and Practice , 2006 .

[18]  Marc Boullé,et al.  Compression-Based Averaging of Selective Naive Bayes Classifiers , 2007, J. Mach. Learn. Res..

[19]  Daniel B. Carr,et al.  Scatterplot matrix techniques for large N , 1986 .

[20]  Thomas Reinartz,et al.  CRISP-DM 1.0: Step-by-step data mining guide , 2000 .

[21]  Pierre Hansen,et al.  Variable neighborhood search: Principles and applications , 1998, Eur. J. Oper. Res..

[22]  Lukasz A. Kurgan,et al.  CAIM discretization algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[23]  Sabine Loudcher,et al.  FUSINTER: A Method for Discretization of Continuous Attributes , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[24]  Ivan Bratko,et al.  Experiments in automatic learning of medical diagnostic rules , 1984 .

[25]  Marek Kretowski,et al.  An Evolutionary Algorithm Using Multivariate Discretization for Decision Rule Induction , 1999, PKDD.

[26]  Gilbert Saporta,et al.  Probabilités, Analyse des données et statistique , 1991 .

[27]  Ron Kohavi,et al.  Error-Based and Entropy-Based Discretization of Continuous Features , 1996, KDD.

[28]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[29]  Tommi S. Jaakkola,et al.  Predictive Discretization during Model Selection , 2004, AISTATS.

[30]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[31]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[32]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[33]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[34]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[35]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[36]  G. Ritschard,et al.  The Behavior of Nominal and Ordinal Partial Association Measures , 1995 .

[37]  Steve R. Gunn,et al.  Design and Analysis of the NIPS2003 Challenge , 2006, Feature Extraction.

[38]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[39]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[40]  M. Boullé Bivariate Data Grid Models for Supervised Learning Note technique , 2008 .

[41]  C. Robert The Bayesian choice : a decision-theoretic motivation , 1996 .

[42]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[43]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[44]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[45]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .