A general framework for accurate and fast regression by data summarization in random decision trees

Predicting the values of continuous variable as a function of several independent variables is one of the most important problems for data mining. A very large number of regression methods, both parametric and nonparametric, have been proposed in the past. However, since the list is quite extensive and many of these models make rather explicit, strong yet different assumptions about the type of applicable problems and involve a lot of parameters and options, choosing the appropriate regression methodology and then specifying the parameter values is a none-trivial, sometimes frustrating, task for data mining practitioners. Choosing the inappropriate methodology can have rather disappointing results. This issue is against the general utility of data mining software. For example,linear regression methods are straightforward and well-understood. However, since the linear assumption is very strong, its performance is compromised for complicated non-linear problems. Kernel-based methods perform quite well if the kernel functions are selected correctly. In this paper, we propose a straightforward approach based on summarizing the training data using an ensemble of random decisions trees. It requires very little knowledge from the user, yet is applicable to every type of regression problem that we are currently aware of. We have experimented on a wide range of problems including those that parametric methods performwell, a large selection of benchmark datasets for nonparametric regression, as well as highly non-linear stochastic problems. Our results are either significantly better than or identical to many approaches that are known to perform well on these problems.

[1]  Benjamin Kedem,et al.  Regression models for time series analysis , 2002 .

[2]  조재현 Goodness of fit tests for parametric regression models , 2004 .

[3]  W. Härdle Applied Nonparametric Regression , 1991 .

[4]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[5]  Jie Yan Ensemble SVM Regression Based Multi-View Face Detection System , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Fei Tony Liu,et al.  The Utility of Randomness in Decision Tree Ensembles , 2006 .

[8]  Wei Tang,et al.  Combining Regression Estimators: GA-Based Selective Neural Network Ensemble , 2001, Int. J. Comput. Intell. Appl..

[9]  Mark R. Segal,et al.  Machine Learning Benchmarks and Random Forest Regression , 2004 .

[10]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[11]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[12]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.

[13]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[14]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[15]  Sholom M. Weiss,et al.  Solving regression problems with rule-based ensemble classifiers , 2001, KDD '01.

[16]  Philip S. Yu,et al.  Effective estimation of posterior probabilities: explaining the accuracy of randomized decision tree approaches , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  Kun Zhang,et al.  Learning through changes: an empirical study of dynamic behaviors of probability estimation trees , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).