MReC4.5: C4.5 Ensemble Classification with MapReduce

Classification is a significant technique in data mining research and applications. C4.5 is a widely used classification method, and ensemble learning adopts a parallel and distributed computing model for classification. Based on analyses of the MapReduce computing paradigm and the process of ensemble learning, we find that the parallel and distributed computing model in MapReduce is appropriate for implementing ensemble learning. This paper takes the advantages of C4.5, ensemble learning and the MapReduce computing model, and proposes a new method MReC4.5 for parallel and distributed ensemble classification. Our experimental results show that increasing the number of nodes would benefit the effectiveness of classification modeling, and serialization operations at the model level make the MReC4.5 classifier “construct once, use anywhere”.

[1]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Nitesh V. Chawla,et al.  Scaling up Classifiers to Cloud Computers , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[5]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Philip K. Chan,et al.  Meta-learning in distributed data mining systems: Issues and approaches , 2007 .

[8]  Ting Liu,et al.  Clustering Billions of Images with Large Scale Nearest Neighbor Search , 2007, 2007 IEEE Workshop on Applications of Computer Vision (WACV '07).

[9]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[10]  Abhinandan Das,et al.  Google news personalization: scalable online collaborative filtering , 2007, WWW '07.

[11]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[12]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[13]  Michael Georgiopoulos,et al.  A Grid Based System for Data Mining Using MapReduce , 2007 .

[14]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[15]  Nitesh V. Chawla,et al.  Learning Ensembles from Bites: A Scalable and Accurate Approach , 2004, J. Mach. Learn. Res..

[16]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .