Computing the Split Points for Learning Decision Tree in MapReduce

The explosive growth of Data is bringing more and more challenges and opportunities to data mining. In data mining, learning decision tree is a common method, in which determining split points is the key problem. Existing methods of calculating split points in the distributed setting on large data either (1) cause high communication overhead or (2) are not universal for different levels of skewness of data distribution. In this paper, we study the properties of Gini impurity, which is a measure for determining split points, and design new algorithms for calculating split points in MapReduce. Empirical evaluation demonstrates that our method outperforms existing state-of-the-art techniques on communication cost and universality.

[1]  Fuzhen Zhuang,et al.  Parallel Implementation of Classification Algorithms Based on MapReduce , 2010, RSKT.

[2]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[3]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[4]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[5]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[6]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[7]  Ashraf Aboulnaga,et al.  ReStore: Reusing Results of MapReduce Jobs , 2012, Proc. VLDB Endow..

[8]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[9]  Qin Zhang,et al.  Optimal tracking of distributed heavy hitters and quantiles , 2009, PODS.

[10]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[11]  Yuanyuan Tian,et al.  CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop , 2011, Proc. VLDB Endow..

[12]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[13]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[14]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[15]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[16]  Nitesh V. Chawla,et al.  Decision tree learning on very large data sets , 1998, SMC.

[17]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[18]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[19]  Lu Wang,et al.  Sampling based algorithms for quantile computation in sensor networks , 2011, SIGMOD '11.

[20]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[21]  Feifei Li,et al.  Building Wavelet Histograms on Large Data in MapReduce , 2011, Proc. VLDB Endow..

[22]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[23]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[24]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[25]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[26]  Joseph M. Hellerstein,et al.  Online aggregation and continuous query support in MapReduce , 2010, SIGMOD Conference.

[27]  Ruoming Jin,et al.  Communication and Memory Efficient Parallel Decision Tree Construction , 2003, SDM.