Optimal Partitioning for Classification and Regression Trees

An iterative algorithm that finds a locally optimal partition for an arbitrary loss function, in time linear in N for each iteration is presented. The algorithm is a K-means-like clustering algorithm that uses as its distance measure a generalization of Kullback's information divergence. Moreover, it is proven that the globally optimal partition must satisfy a nearest neighbour condition using divergence as the distance measure. These results generalize similar results of L. Breiman et al. (1984) to an arbitrary number of classes or regression variables and to an arbitrary number of bills. Experimental results on a text-to-speech example are provided and additional applications of the algorithm, including the design of variable combinations, surrogate splits, composite nodes, and decision graphs, are suggested. >

[1]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[2]  Michael Montalbano Tables, Flow Charts and Program Logic , 1962, IBM Syst. J..

[3]  J. F. Egler,et al.  A procedure for converting logic table conditions into an efficient sequence of test instructions , 1963, CACM.

[4]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[5]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[6]  Solomon L. Pollack,et al.  Conversion of limited-entry decision tables to computer programs , 1965, CACM.

[7]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[8]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[9]  Lewis T. Reinwald,et al.  Conversion of Limited-Entry Decision Tables to Optimal Computer Programs II: minimum storage requirement , 1967, JACM.

[10]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[11]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[12]  King-Sun Fu,et al.  A Nonparametric Partitioning Procedure for Pattern Classification , 1969, IEEE Transactions on Computers.

[13]  Keith Shwayder,et al.  Conversion of limited-entry decision tables to computer programs—a proposed modification to Pollack's algorithm , 1971, CACM.

[14]  A. J. Bayes A Dynamic Programming Algorithm to Optimise Decision Table Code , 1973, Aust. Comput. J..

[15]  S. Ganapathy,et al.  Information theory applied to the conversion of decision tables to computer programs , 1973, CACM.

[16]  William S. Meisel,et al.  A Partitioning Algorithm with Application in Pattern Classification and the Optimization of Decision Trees , 1973, IEEE Transactions on Computers.

[17]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[18]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[19]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[20]  Kenneth C. Sevcik,et al.  The synthetic approach to decision table conversion , 1976, CACM.

[21]  Ishwar K. Sethi,et al.  Efficient decision tree design for discrete variable pattern recognition problems , 1977, Pattern Recognition.

[22]  Philip H. Swain,et al.  Purdue e-Pubs , 2022 .

[23]  William S. Meisel,et al.  An Algorithm for Constructing Optimal Binary Decision Trees , 1977, IEEE Transactions on Computers.

[24]  Alberto Martelli,et al.  Optimizing decision trees through heuristically guided search , 1978, CACM.

[25]  B. Efron Regression and ANOVA with Zero-One Data: Measures of Residual Variation , 1978 .

[26]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[27]  J. R. Quinlan,et al.  Induction over large data bases , 1979 .

[28]  R. Gray,et al.  Distortion measures for speech processing , 1980 .

[29]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[30]  Robert M. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[31]  D. A. Preece,et al.  Identification Keys and Diagnostic Tables: a Review , 1980 .

[32]  Robert M. Gray,et al.  Rate-distortion speech coding with a minimum discrimination information distortion measure , 1981, IEEE Trans. Inf. Theory.

[33]  I. K. Sethi,et al.  Hierarchical Classifier Design Using Mutual Information , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[35]  Biing-Hwang Juang,et al.  An 800 bit/s vector quantization LPC vocoder , 1982 .

[36]  Robert M. Gray,et al.  Minimum Cross-Entropy Pattern Classification and Cluster Analysis , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Pramod K. Varshney,et al.  Application of information theory to the construction of efficient decision trees , 1982, IEEE Trans. Inf. Theory.

[38]  Robert L. Mercer,et al.  An information theoretic approach to the automatic determination of phonemic baseforms , 1984, ICASSP.

[39]  Arthur Nádas,et al.  On Turing's formula for word probabilities , 1985, IEEE Trans. Acoust. Speech Signal Process..

[40]  Jorma Rissanen,et al.  Complexity of strings in the class of Markov sources , 1986, IEEE Trans. Inf. Theory.

[41]  W. Equitz Fast algorithms for vector quantization picture coding , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[43]  Sholom M. Weiss,et al.  Optimizing the Predictive Value of Diagnostic Decision Rules , 1987, AAAI.

[44]  Jie Cheng,et al.  Improved Decision Trees: A Generalized Version of ID3 , 1988, ML.

[45]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[46]  R. Gray,et al.  Applications of information theory to pattern recognition and the design of decision trees and trellises , 1988 .

[47]  Philip A. Chou,et al.  Optimal pruning with applications to tree-structured source coding and modeling , 1989, IEEE Trans. Inf. Theory.

[48]  J. Chambers,et al.  The New S Language , 1989 .

[49]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[50]  Lalit R. Bahl,et al.  A tree-based statistical language model for natural language speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[51]  D. Burshtein,et al.  A Splitting Theorem For Tree Construction , 1991, Proceedings. 1991 IEEE International Symposium on Information Theory.

[52]  Daryl Pregibon,et al.  Tree-based models , 1992 .

[53]  V. D. Pietra,et al.  Minimum Impurity Partitions , 1992 .