Assessing MapReduce for Internet Computing: A Comparison of Hadoop and BitDew-MapReduce

MapReduce is emerging as an important programming model for data-intensive application. Adapting this model to desktop grid would allow taking advantage of the vast amount of computing power and distributed storage to execute new range of application able to process enormous amount of data. In 2010, we have presented the first implementation of MapReduce dedicated to Internet Desktop Grid based on the BitDew middleware. In this paper, we present new optimizations to BitDew-MapReduce (BitDew-MR): aggressive task backup, intermediate result backup, task re-execution mitigation and network failure hiding. We propose a new experimental framework which emulates key fundamental aspects of Internet Desktop Grid. Using the framework, we compare BitDew-MR and the open-source Hadoop middleware on Grid5000. Our experimental results show that 1) BitDew-MR successfully passes all the stress-tests of the framework while Hadoop is unable to work in typical wide-area network topology which includes PC hidden behind firewall and NAT; 2) BitDew-MR outperforms Hadoop performances on several aspects: scalability, fairness, resilience to node failures, and network disconnections.

[1]  Beng Chin Ooi,et al.  The performance of MapReduce , 2010, Proc. VLDB Endow..

[2]  Hai Jin,et al.  Maestro: Replica-Aware Map Scheduling for MapReduce , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  Gilles Fedak,et al.  BitDew: A programmable environment for large-scale data management and distribution , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Wu-chun Feng,et al.  MOON: MapReduce On Opportunistic eNvironments , 2010, HPDC '10.

[5]  Douglas Thain,et al.  Distributed computing in practice: the Condor experience , 2005, Concurr. Pract. Exp..

[6]  Joseph M. Hellerstein,et al.  MapReduce Online , 2010, NSDI.

[7]  Andrew A. Chien,et al.  Henri Casanova , 2022 .

[8]  David P. Anderson,et al.  EmBOINC: An emulator for performance analysis of BOINC projects , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[10]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[11]  Hai Jin,et al.  Adaptive Disk I/O Scheduling for MapReduce in Virtualized Environment , 2011, 2011 International Conference on Parallel Processing.

[12]  Jean-Marc Vincent,et al.  Mining for statistical models of availability in large-scale distributed systems: An empirical study of SETI@home , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[13]  Gilles Fedak,et al.  Distributed Results Checking for MapReduce in Volunteer Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[14]  Indranil Gupta,et al.  Making cloud intermediate data fault-tolerant , 2010, SoCC '10.

[15]  Emmanuel Jeannot,et al.  Wrekavoc: a tool for emulating heterogeneity , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[16]  Henri Casanova,et al.  SimGrid: A Generic Framework for Large-Scale Distributed Experiments , 2008, Tenth International Conference on Computer Modeling and Simulation (uksim 2008).

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Kurt Stockinger,et al.  OptorSim-A Grid Simulator for Studying Dynamic Data Replication Strategies , 2003 .

[19]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[20]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[21]  Gilles Fedak,et al.  Towards MapReduce for Desktop Grid Computing , 2010, 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing.

[22]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[23]  Thomas Hérault,et al.  Computing on large-scale distributed systems: XtremWeb architecture, programming models, security, tests and convergence with grid , 2005, Future Gener. Comput. Syst..

[24]  Gilles Fedak,et al.  The Computational and Storage Potential of Volunteer Computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[25]  GhemawatSanjay,et al.  The Google file system , 2003 .

[26]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[27]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[28]  Copyright © Intel Corporation 2008 * Other names and brands may be claimed as the property of others , 2004 .