Fast growing and interpretable oblique trees via logistic regression models

The classification tree is an attractive method for classification as the predictions it makes are more transparent than most other classifiers. The most widely accepted approaches to tree-growth use axis-parallel splits to partition continuous attributes. Since the interpretability of a tree diminishes as it grows larger, researchers have sought ways of growing trees with oblique splits as they are better able to partition observations. The focus of this thesis is to grow oblique trees in a fast and deterministic manner and to propose ways of making them more interpretable. Finding good oblique splits is a computationally difficult task. Various authors have proposed ways of doing this by either performing stochastic searches or by solving problems that effectively produce oblique splits at each stage of tree-growth. A new approach to finding such splits is proposed that restricts attention to a small but comprehensive set of splits. Empirical evidence shows that good oblique splits are found in most cases. When observations come from a small number of classes, empirical evidence shows that oblique trees can be grown in a matter of seconds. As interpretability is the main strength of classification trees, it is important for oblique trees that are grown to be interpretable. As the proposed approach to finding oblique splits makes use of logistic regression, well-founded variable selection techniques are introduced to classification trees. This allows concise oblique splits to be found at each stage of tree-growth so that oblique trees that are more interpretable can be directly grown. In addition to this, cost-complexity pruning ideas which were developed for axis-parallel trees have been adapted to make oblique trees more interpretable. A major and practical component of this thesis is in providing the oblique.tree package in R that allows casual users to experiment with oblique trees in a way that was not possible before.

[1]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[2]  B. Axelrod,et al.  Perceptron-Based Oblique Tree ( P-BOT ) , 2005 .

[3]  J. Aitchison,et al.  Statistical Prediction Analysis , 1975 .

[4]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[7]  Chandrika Kamath,et al.  Inducing oblique decision trees with evolutionary algorithms , 2003, IEEE Trans. Evol. Comput..

[8]  King-Sun Fu,et al.  A Nonparametric Partitioning Procedure for Pattern Classification , 1969, IEEE Transactions on Computers.

[9]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[10]  D. G. Simpson,et al.  The Statistical Analysis of Discrete Data , 1989 .

[11]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[12]  Nello Cristianini,et al.  Enlarging the Margins in Perceptron Decision Trees , 2000, Machine Learning.

[13]  Simon Kasif,et al.  Induction of Oblique Decision Trees , 1993, IJCAI.

[14]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[15]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[16]  Bruce A. Draper,et al.  Goal-Directed Classification Using Linear Machine Decision Trees , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Donald E. Brown,et al.  Classification trees with optimal multivariate decision nodes , 1996, Pattern Recognit. Lett..

[18]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[19]  Marcus R. Frean,et al.  A "Thermal" Perceptron Learning Rule , 1992, Neural Computation.

[20]  Carla E. Brodley,et al.  Linear Machine Decision Trees , 1991 .

[21]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[22]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[23]  K. Bennett,et al.  A support vector machine approach to decision trees , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[24]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[25]  H. Akaike A new look at the statistical model identification , 1974 .

[26]  A. Albert,et al.  On the existence of maximum likelihood estimates in logistic regression models , 1984 .

[27]  P. Utgoff,et al.  Multivariate Versus Univariate Decision Trees , 1992 .

[28]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[29]  V. Cerný Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm , 1985 .

[30]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[31]  Yang Jing L1 Regularization Path Algorithm for Generalized Linear Models , 2008 .

[32]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[33]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[34]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[35]  Kristin P. Bennett Machine learning via mathematical programming , 1993 .

[36]  Pierre Geurts,et al.  Contributions to decision tree induction: bias/variance tradeoff and time series classification , 2002 .

[37]  G. V. Kass An Exploratory Technique for Investigating Large Quantities of Categorical Data , 1980 .

[38]  Paul E. Utgoff,et al.  Perceptron Trees : A Case Study in ybrid Concept epresentations , 1999 .

[39]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[40]  S. Amari,et al.  Network Information Criterion | Determining the Number of Hidden Units for an Articial Neural Network Model Network Information Criterion | Determining the Number of Hidden Units for an Articial Neural Network Model , 2007 .

[41]  David L. Dowe,et al.  MML Inference of Oblique Decision Trees , 2004, Australian Conference on Artificial Intelligence.

[42]  Chandrika Kamath,et al.  Using Evolutionary Algorithms to Induce Oblique Decision Trees , 2000, GECCO.

[43]  N. Campbell,et al.  A multivariate study of variation in two species of rock crab of the genus Leptograpsus , 1974 .

[44]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .