BIGhybrid: a simulator for MapReduce applications in hybrid distributed infrastructures validated with the Grid5000 experimental platform

Cloud computing has increasingly been used as a platform for running large business and data processing applications. Conversely, Desktop Grids have been successfully employed in a wide range of projects, because they are able to take advantage of a large number of resources provided free of charge by volunteers. A hybrid infrastructure created from the combination of Cloud and Desktop Grids infrastructures can provide a low‐cost and scalable solution for Big Data analysis. Although frameworks like MapReduce have been designed to exploit commodity hardware, their ability to take advantage of a hybrid infrastructure poses significant challenges because of their large resource heterogeneity and high churn rate. In this paper, BIGhybrid is proposed, a simulator for two existing classes of MapReduce runtime environments: BitDew‐MapReduce designed for Desktop Grids and BlobSeer‐Hadoop designed for Cloud computing, where the goal is to carry out accurate simulations of MapReduce executions in a hybrid infrastructure composed of Cloud computing and Desktop Grid resources. This work describes the principles of the simulator and describes the validation of BIGhybrid with the Grid5000 experimental platform. Owing to BIGhybrid, developers can investigate and evaluate new algorithms to enable MapReduce to be executed in hybrid infrastructures. This includes topics such as resource allocation and data splitting. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  Rui Wang,et al.  Bridging Data in the Clouds: An Environment-Aware System for Geographically Distributed Data Transfers , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[2]  Rajkumar Buyya,et al.  InterCloud: Utility-Oriented Federation of Cloud Computing Environments for Scaling of Application Services , 2010, ICA3PP.

[3]  Luciana Arantes,et al.  MRSG - A MapReduce simulator over SimGrid , 2013, Parallel Comput..

[4]  Gabriel Antoniu,et al.  MapIterativeReduce: a framework for reduction-intensive data processing on azure clouds , 2012, MapReduce '12.

[5]  Guihai Chen,et al.  STAR: Strategy-Proof Double Auctions for Multi-Cloud, Multi-Tenant Bandwidth Reservation , 2015, IEEE Transactions on Computers.

[6]  R. Prodan,et al.  GroudSim: An Event-Based Simulation Framework for Computational Grids and Clouds , 2010, Euro-Par Workshops.

[7]  Carole A. Goble,et al.  Towards BioDBcore: a community-defined information specification for biological databases , 2011, Database : the journal of biological databases and curation.

[8]  Luciana Arantes,et al.  MRA++: Scheduling and data placement on MapReduce for heterogeneous environments , 2015, Future Gener. Comput. Syst..

[9]  Robert B. Ross,et al.  Data-Aware Resource Scheduling for Multicloud Workflows: A Fine-Grained Simulation Approach , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[10]  Syed Haider,et al.  Ensembl BioMarts: a hub for data retrieval across taxonomic space , 2011, Database J. Biol. Databases Curation.

[11]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[12]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[13]  Zhen Zhang,et al.  Compositional Model Checking of Concurrent Systems , 2015, IEEE Transactions on Computers.

[14]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[15]  Gilles Fedak,et al.  Scalable data management for map-reduce-based data-intensive applications: a view for cloud and hybrid infrastructures , 2013, Int. J. Cloud Comput..

[16]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[17]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[18]  Olaf Spinczyk,et al.  FederatedCloudSim: a SLA-aware federated cloud simulation framework , 2014, CCB '14.

[19]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[20]  Patrick Th. Eugster,et al.  From the Cloud to the Atmosphere: Running MapReduce across Data Centers , 2014, IEEE Transactions on Computers.

[21]  Takahiro Hirofuchi,et al.  Adding Virtual Machine Abstractions Into SimGrid: A First Step Toward the Simulation of Infrastructure-as-a-Service Concerns , 2013, 2013 International Conference on Cloud and Green Computing.

[22]  Ali Raza Butt,et al.  hatS: A Heterogeneity-Aware Tiered Storage for Hadoop , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[23]  Archana Ganapathi,et al.  The Case for Evaluating MapReduce Performance Using Workload Suites , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[24]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..

[25]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[26]  Ciprian Dobre,et al.  Parallel Programming Paradigms and Frameworks in Big Data Era , 2013, International Journal of Parallel Programming.

[27]  Gilles Fedak,et al.  XtremWeb: a generic global computing system , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[28]  Rajkumar Buyya,et al.  A toolkit for modelling and simulating data Grids: an extension to GridSim , 2008, Concurr. Comput. Pract. Exp..

[29]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[30]  Gilles Fedak,et al.  BitDew: A programmable environment for large-scale data management and distribution , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Ulf Leser,et al.  DynamicCloudSim: simulating heterogeneity in computational clouds , 2013, SWEET '13.

[32]  Gilles Fedak,et al.  Assessing MapReduce for Internet Computing: A Comparison of Hadoop and BitDew-MapReduce , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[33]  Gilles Fedak,et al.  Desktop Grid Computing , 2012 .

[34]  Gabriel Antoniu,et al.  BlobSeer: Bringing high throughput under heavy concurrency to Hadoop Map-Reduce applications , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[35]  Gilles Fedak,et al.  Distributed Results Checking for MapReduce in Volunteer Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[36]  Geoffrey C. Fox,et al.  MapReduce for Data Intensive Scientific Analyses , 2008, 2008 IEEE Fourth International Conference on eScience.

[37]  Jean-Marc Vincent,et al.  Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.

[38]  Gilles Fedak,et al.  Analysis of Data Reliability Tradeoffs in Hybrid Distributed Storage Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[39]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[40]  Spyros Makridakis Time-Series Analysis and Forecasting: An Update and Evaluation , 1978 .

[41]  Gilles Fedak,et al.  SpeQuloS: a QoS service for BoT applications using best effort distributed computing infrastructures , 2012, HPDC '12.

[42]  Gilles Fedak,et al.  BIGhybrid -- A Toolkit for Simulating MapReduce in Hybrid Infrastructures , 2014, 2014 International Symposium on Computer Architecture and High Performance Computing Workshop.

[43]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .