SECRET: a scalable linear regression tree algorithm

Developing regression models for large datasets that are both accurate and easy to interpret is a very important data mining problem. Regression trees with linear models in the leaves satisfy both these requirements, but thus far, no truly scalable regression tree algorithm is known. This paper proposes a novel regression tree construction algorithm (SECRET) that produces trees of high quality and scales to very large datasets. At every node, SECRET uses the EM algorithm for Gaussian mixtures to find two clusters in the data and to locally transform the regression problem into a classification problem based on closeness to these clusters. Goodness of split measures, like the gini gain, can then be used to determine the split variable and the split point much like in classification tree construction. Scalability of the algorithm can be achieved by employing scalable versions of the EM and classification tree construction algorithms. An experimental evaluation on real and artificial data shows that SECRET has accuracy comparable to other linear regression tree algorithms but takes orders of magnitude less computation time for large datasets.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[3]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[4]  Aram Karalic Linear Regression in Regression Tree Leaves , 1992 .

[5]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[6]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[7]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[8]  Jerome H. Friedman Multivariate adaptive regression splines (with discussion) , 1991 .

[9]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[12]  Aram Karalic,et al.  Employing Linear Regression in Regression Tree Leaves , 1992, ECAI.

[13]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  Luís Torgo,et al.  Functional Models for Regression Tree Leaves , 1997, ICML.

[16]  J. Friedman Multivariate adaptive regression splines , 1990 .

[17]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[18]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[19]  Luís Torgo Error Estimators for Pruning Regression Trees , 1998, ECML.

[20]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[21]  Ker-Chau Li,et al.  Interactive Tree-Structured Regression via Principal Hessian Directions , 2000 .

[22]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[23]  P. Chaudhuri,et al.  Piecewise polynomial regression trees , 1994 .