Efficient Distributed Decision Trees for Robust Regression

The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach. The data and software related to this paper are available at https://github.com/weilai0980/DRSquare_tree/tree/master/.

[1]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[2]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[3]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[4]  L. Torgo,et al.  Inductive learning of tree-based regression models , 1999 .

[5]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[6]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[7]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[8]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[9]  Fernando De la Torre,et al.  Robust Regression , 2016, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Alex Alves Freitas,et al.  Mining Very Large Databases with Parallel Processing , 1997, The Kluwer International Series on Advances in Database Systems.

[11]  James G. Shanahan,et al.  Large Scale Distributed Data Science using Apache Spark , 2015, KDD.

[12]  Yike Guo,et al.  Parallel Induction Algorithms for Data Mining , 1997, IDA.

[13]  Jian Yang,et al.  Robust Tree-based Causal Inference for Complex Ad Effectiveness Analysis , 2015, WSDM.

[14]  C. Ling,et al.  Decision Tree with Better Ranking , 2003, ICML.

[15]  Tamara G. Kolda,et al.  COMET: A Recipe for Learning and Using Large Ensembles on Massive Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[16]  Stephen Tyree,et al.  Parallel boosted regression trees for web search ranking , 2011, WWW.

[17]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[18]  Donato Malerba,et al.  A Comparative Analysis of Methods for Pruning Decision Trees , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Lior Rokach,et al.  ConfDTree: Improving Decision Trees Using Confidence Intervals , 2012, 2012 IEEE 12th International Conference on Data Mining.

[20]  Louis Wehenkel,et al.  A complete fuzzy decision tree technique , 2003, Fuzzy Sets Syst..

[21]  Cezary Z. Janikow,et al.  Fuzzy decision trees: issues and methods , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[22]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.