论文信息 - Efficient Range Partitioning in Classification Learning

Efficient Range Partitioning in Classification Learning

Partitioning of data is the essence of many machine learning and data mining methods. It is particularly important in classiication learning tasks, where the aim is to induce rules, decision trees or network structures that separate instances of diierent classes. This thesis examines the problem of partitioning ordered value ranges into two or more subsets, optimally with respect to an evaluation function. This task is encountered in the induction of multisplitting decision trees and, in many learning paradigms, as a data preprocessing stage preceding the actual learning phase. The goal of the partitioning in preprocessing is to transform the data into a form that better suits the learning algorithm or to decrease the resource-demands of the algorithm; the handling of numerical values during learning is often the bottleneck in time consumption. No polynomial-time algorithm is known for the range partitioning task in the general case, which has led to a number of heuristic approaches. These methods are fast and some of them produce good|but sub-optimal|partitions in practice. We study ways to make optimal partitioning more feasible in terms of time-complexity. The approach taken in this study is to take advantage of the general properties of the evaluation functions to decrease the computational demands. We show that many commonly used evaluation functions obtain their minima on a well-deened subset of all cut point combinations. This subset can be identiied in a linear-time preprocessing step. The size of the subset is not directly dependent on the size of the dataset but on the diversity of the class distributions along the numerical range. Restricting the class of evaluation functions enables quadratic-or cubic-time evaluation over the preprocessed sequence by dynamic programming. We i introduce a pruning technique that let us speed up the algorithms further. In our tests on a large number of publicly available datasets, the average speed up from these improvements was over 50% of the running-time. As an application, we consider the induction of multisplitting decision trees. We present a comprehensive experimental comparison between the binary splitting, optimal multisplitting and heuristic multisplitting strategies using two well-known evaluation functions. We examine ways to postpone the evaluation of seemingly irrelevant attributes to a later stage, in order to further improve the eeciency of the tree induction. Our main conclusion from these studies is that generating optimal multisplits during tree induction is feasible. However, the predictive accuracy of decision trees only marginally depends on …

Juho Rousu | Juho Rousu

[1] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[2] Ian H. Witten,et al. Text Compression , 1990, 125 Problems in Text Algorithms.

[3] C. Brodley,et al. On the Qualitative Behavior of Impurity-Based Splitting Rules I: The Minima-Free Property , 1997 .

[4] Marco Richeldi,et al. Class-Driven Statistical Discretization of Continuous Attributes (Extended Abstract) , 1995, ECML.

[5] Xindong Wu,et al. A Bayesian Discretizer for Real-Valued Attributes , 1996, Comput. J..

[6] C. S. Wallace,et al. Estimation and Inference by Compact Coding , 1987 .

[7] Thomas G. Dietterich,et al. Bootstrap Methods for the Cost-Sensitive Evaluation of Classifiers , 2000, ICML.

[8] Paul E. Utgoff,et al. Classification Using Φ-Machines and Constructive Function Approximation , 1998, Machine Learning.

[9] Igor Kononenko,et al. ReliefF for estimation and discretization of attributes in classification, regression, and ILP probl , 1996 .

[10] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[11] Usama M. Fayyad,et al. The Attribute Selection Problem in Decision Tree Generation , 1992, AAAI.