Efficient Range Partitioning in Classification Learning

Partitioning of data is the essence of many machine learning and data mining methods. It is particularly important in classiication learning tasks, where the aim is to induce rules, decision trees or network structures that separate instances of diierent classes. This thesis examines the problem of partitioning ordered value ranges into two or more subsets, optimally with respect to an evaluation function. This task is encountered in the induction of multisplitting decision trees and, in many learning paradigms, as a data preprocessing stage preceding the actual learning phase. The goal of the partitioning in preprocessing is to transform the data into a form that better suits the learning algorithm or to decrease the resource-demands of the algorithm; the handling of numerical values during learning is often the bottleneck in time consumption. No polynomial-time algorithm is known for the range partitioning task in the general case, which has led to a number of heuristic approaches. These methods are fast and some of them produce good|but sub-optimal|partitions in practice. We study ways to make optimal partitioning more feasible in terms of time-complexity. The approach taken in this study is to take advantage of the general properties of the evaluation functions to decrease the computational demands. We show that many commonly used evaluation functions obtain their minima on a well-deened subset of all cut point combinations. This subset can be identiied in a linear-time preprocessing step. The size of the subset is not directly dependent on the size of the dataset but on the diversity of the class distributions along the numerical range. Restricting the class of evaluation functions enables quadratic-or cubic-time evaluation over the preprocessed sequence by dynamic programming. We i introduce a pruning technique that let us speed up the algorithms further. In our tests on a large number of publicly available datasets, the average speed up from these improvements was over 50% of the running-time. As an application, we consider the induction of multisplitting decision trees. We present a comprehensive experimental comparison between the binary splitting, optimal multisplitting and heuristic multisplitting strategies using two well-known evaluation functions. We examine ways to postpone the evaluation of seemingly irrelevant attributes to a later stage, in order to further improve the eeciency of the tree induction. Our main conclusion from these studies is that generating optimal multisplits during tree induction is feasible. However, the predictive accuracy of decision trees only marginally depends on …

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[3]  C. Brodley,et al.  On the Qualitative Behavior of Impurity-Based Splitting Rules I: The Minima-Free Property , 1997 .

[4]  Marco Richeldi,et al.  Class-Driven Statistical Discretization of Continuous Attributes (Extended Abstract) , 1995, ECML.

[5]  Xindong Wu,et al.  A Bayesian Discretizer for Real-Valued Attributes , 1996, Comput. J..

[6]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[7]  Thomas G. Dietterich,et al.  Bootstrap Methods for the Cost-Sensitive Evaluation of Classifiers , 2000, ICML.

[8]  Paul E. Utgoff,et al.  Classification Using Φ-Machines and Constructive Function Approximation , 1998, Machine Learning.

[9]  Igor Kononenko,et al.  ReliefF for estimation and discretization of attributes in classification, regression, and ILP probl , 1996 .

[10]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[11]  Usama M. Fayyad,et al.  The Attribute Selection Problem in Decision Tree Generation , 1992, AAAI.

[12]  Kai Ming Ting,et al.  Discretization of Continuous-Valued Attributes and Instance-Based Learning , 1994 .

[13]  C. S. Wallace,et al.  Coding Decision Trees , 1993, Machine Learning.

[14]  Randy Kerber,et al.  ChiMerge: Discretization of Numeric Attributes , 1992, AAAI.

[15]  Nello Cristianini,et al.  Data-Dependent Structural Risk Minimization for Perceptron Decision Trees , 1997, NIPS.

[16]  Jason Catlett,et al.  On Changing Continuous Attributes into Ordered Discrete Attributes , 1991, EWSL.

[17]  Simon Kasif,et al.  OC1: A Randomized Induction of Oblique Decision Trees , 1993, AAAI.

[18]  Jukka Hekanaho DOGMA: A GA-Based Relational Learner , 1998, ILP.

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  J. Ross Quinlan,et al.  Simplifying Decision Trees , 1987, Int. J. Man Mach. Stud..

[21]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[22]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[23]  Nir Friedman,et al.  Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting , 1998, ICML.

[24]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[25]  Michael J. Pazzani,et al.  An Iterative Improvement Approach for the Discretization of Numeric Attributes in Bayesian Classifiers , 1995, KDD.

[26]  J. Kent Martin,et al.  An Exact Probability Metric for Decision Tree Splitting and Stopping , 1997, Machine Learning.

[27]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[28]  Nir Friedman,et al.  Discretizing Continuous Attributes While Learning Bayesian Networks , 1996, ICML.

[29]  Steven Salzberg,et al.  Lookahead and Pathology in Decision Tree Induction , 1995, IJCAI.

[30]  Carla E. Brodley,et al.  Multivariate decision trees , 2004, Machine Learning.

[31]  Paul D. Scott,et al.  Zeta: A Global Method for Discretization of Continuous Variables , 1997, KDD.

[32]  Glen G. Langdon,et al.  Arithmetic Coding , 1979 .

[33]  Ray J. Hickey,et al.  Noise Modelling and Evaluating Learning from Examples , 1996, Artif. Intell..

[34]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[35]  Changhwan Lee,et al.  A Context-Sensitive Discretization of Numeric Attributes for Classification Learning , 1994, ECAI.

[36]  Igor Kononenko,et al.  A counter example to the stronger version of the binarytree hypothesisIgor , 1995 .

[37]  Huan Liu,et al.  Feature Selection via Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[38]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[39]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[40]  Ramón López de Mántaras,et al.  Proposal and Empirical Comparison of a Parallelizable Distance-Based Discretization Method , 1997, KDD.

[41]  Jonathan R. M. Hosking,et al.  Partitioning Nominal Attributes in Decision Trees , 1999, Data Mining and Knowledge Discovery.

[42]  Djamel A. Zighed,et al.  Optimal Multiple Intervals Discretization of Continuous Attributes for Supervised Learning , 1997, KDD.

[43]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[44]  Wei-Yin Loh,et al.  Split Selection Methods for Classication Trees Published in Statistica Sinica, 1997, Vol. 7, pp. 815{840 , 1997 .

[45]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[46]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[47]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[48]  Andreas Birkendorf On Fast and Simple Algorithms for Finding Maximal Subarrays and Applications in Learning Theory , 1997, EuroCOLT.

[49]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[50]  Ivan Bratko,et al.  Experiments in automatic learning of medical diagnostic rules , 1984 .

[51]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[52]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[53]  Bernhard Pfahringer,et al.  Compression-Based Discretization of Continuous Attributes , 1995, ICML.

[54]  Dana Angluin,et al.  Computational learning theory: survey and selected bibliography , 1992, STOC '92.

[55]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[56]  Ron Kohavi,et al.  Useful Feature Subsets and Rough Set Reducts , 1994 .

[57]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[58]  J. R. Quilan Decision trees and multi-valued attributes , 1988 .

[59]  Se June Hong,et al.  Use of Contextaul Information for Feature Ranking and Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[60]  Ian H. Witten,et al.  Using a Permutation Test for Attribute Selection in Decision Trees , 1998, ICML.

[61]  Andrew K. C. Wong,et al.  Class-Dependent Discretization for Inductive Learning from Continuous and Mixed-Mode Data , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[62]  Gregory F. Cooper,et al.  A latent variable model for multivariate discretization , 1999, AISTATS.

[63]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[64]  Yishay Mansour,et al.  A Fast, Bottom-Up Decision Tree Pruning Algorithm with Near-Optimal Generalization , 1998, ICML.

[65]  Tapio Elomaa,et al.  Tools and Techniques for Decision Tree Learning , 1996 .

[66]  William W. Cohen Fast Eeective Rule Induction , 1995 .

[67]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[68]  J. Stochastic Complexity in Learning , .

[69]  Peter Auer,et al.  Theory and Applications of Agnostic PAC-Learning with Small Decision Trees , 1995, ICML.

[70]  J. Praagman Book reviewClassification and regression trees: Leo BREIMAN, Jerome H. FRIEDMAN, Richard A. OLSHEN and Charles J. STONE The Wadsworth Statistics/Probability Series, Wadsworth, Belmont, 1984, x + 358 pages , 1985 .

[71]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.