Meta-learning for large scale machine learning with MapReduce

We have entered the big data age. Knowledge extraction from massive data is becoming more and more rewarding and urgent. MapReduce has provided a feasible framework for programming machine learning algorithms in Map and Reduce functions. The relatively simple programming interface has helped to solve machine learning algorithms' scalability problems. However, this framework suffers from an obvious weakness: it does not support iterations. This makes those algorithms requiring iterations difficult to fully explore the efficiency of MapReduce. In this paper, we propose to apply Meta-learning programmed with MapReduce to avoid parallelizing machine learning algorithms while also improving their scalability to big datasets. The experiments conducted on Hadoop fully distributed mode on Amazon EC2 demonstrate that our algorithm PML reduces the training computational complexity significantly when the number of computing nodes increases while gaining smaller error rates than those on one single node. The comparison of PML with the contemporary parallelized AdaBoost algorithm: AdaBoost.PL shows that PML has lower error rates.

[1]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[2]  E. Chang,et al.  Parallel algorithms for mining large-scale rich-media data , 2009, ACM Multimedia.

[3]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[4]  Rajkumar Buyya,et al.  MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms , 2008, 2008 IEEE Fourth International Conference on eScience.

[5]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.

[6]  Markus Weimer,et al.  A Convenient Framework for Efficient Parallel Multipass Algorithms , 2010 .

[7]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[8]  Kirk P. Arnett,et al.  The size of the IT job market , 2008, CACM.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Stan Matwin,et al.  An Ensemble Method Based on AdaBoost and Meta-Learning , 2013, Canadian Conference on AI.

[11]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[12]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[13]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[14]  Xavier Llorà,et al.  Large scale data mining using genetics-based machine learning , 2009, GECCO '09.

[15]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Alexander K. Seewald,et al.  How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness , 2002, International Conference on Machine Learning.

[18]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[20]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[21]  Tom M. Mitchell,et al.  The Need for Biases in Learning Generalizations , 2007 .

[22]  Chuck Lam,et al.  Hadoop in Action , 2010 .

[23]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[24]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.