LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees

Logistic regression is a powerful technique for fitting models to data with a binary response variable, but the models are difficult to interpret if collinearity, nonlinearity, or interactions are present. Besides, it is hard to judge model adequacy because there are few diagnostics for choosing variable transformations and no true goodness-of-fit test. To overcome these problems, this article proposes to fit a piecewise (multiple or simple) linear logistic regression model by recursively partitioning the data and fitting a different logistic regression in each partition. This allows nonlinear features of the data to be modeled without requiring variable transformations. The binary tree that results from the partitioning process is pruned to minimize a cross-validation estimate of the predicted deviance. This obviates the need for a formal goodness-of-fit test. The resulting model is especially easy to interpret if a simple linear logistic regression is fitted to each partition, because the tree structure and the set of graphs of the fitted functions in the partitions comprise a complete visual description of the model. Trend-adjusted chi-square tests are used to control bias in variable selection at the intermediate nodes. This protects the integrity of inferences drawn from the tree structure. The method is compared with standard stepwise logistic regression on 30 real datasets, with several containing tens to hundreds of thousands of observations. Averaged across the datasets, the results show that the method reduces predicted mean deviance by 9% to 16%.We use an example from the Dutch insurance industry to demonstrate how the method can identify and produce an intelligible profile of prospective customers.

[1]  Nellie Clarke Brown Trees , 1896, Savage Dreams.

[2]  W. G. Cochran Some Methods for Strengthening the Common χ 2 Tests , 1954 .

[3]  P. Armitage Tests for Linear Trends in Proportions and Frequencies , 1955 .

[4]  H. Akaike A new look at the statistical model identification , 1974 .

[5]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[6]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[7]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[8]  Daryl Pregibon,et al.  Tree-based models , 1992 .

[9]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[10]  John A. Nelder,et al.  Generalized linear models. 2nd ed. , 1993 .

[11]  P. Chaudhuri,et al.  Piecewise polynomial regression trees , 1994 .

[12]  W. Loh,et al.  Generalized regression trees , 1995 .

[13]  Sholom M. Weiss,et al.  Rule-based Machine Learning Methods for Functional Prediction , 1995, J. Artif. Intell. Res..

[14]  Calvin L. Williams,et al.  Modern Applied Statistics with S-Plus , 1997 .

[15]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[16]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[17]  Chap T. Le,et al.  Applied Categorical Data Analysis , 1998 .

[18]  Marcia M. A. Schafgans Ethnic wage differences in Malaysia: parametric and semiparametric estimation of the Chinese–Malay wage gap , 1998 .

[19]  Dan Steinberg,et al.  THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING , 1998 .

[20]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[21]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[22]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[23]  Paul D. Allison,et al.  Logistic Regression Using the SAS System : Theory and Application , 1999 .

[24]  J. Brian Gray,et al.  Applied Regression Including Computing and Graphics , 1999, Technometrics.

[25]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[26]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[27]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[28]  Jeffrey S. Simonoff,et al.  Tree Induction Vs Logistic Regression: A Learning Curve Analysis , 2001, J. Mach. Learn. Res..

[29]  J. Moake,et al.  This article has been cited by other articles , 2003 .

[30]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.

[31]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[32]  Zuhair Bandar,et al.  Genetic tuning of fuzzy inference within fuzzy classifier systems , 2006, Expert Syst. J. Knowl. Eng..