BigDataNetSim: A Simulator for Data and Process Placement in Large Big Data Platforms

Big Data platforms are convoluted distributed systems which commonly comprise skill- and labour-intensive solution development to treat inherent Big Data application challenges. Several tools have been proposed to help developers and engineers to overcome the involved complexities in coordinating the execution of plenty processes/threads on multiple machines. However, no work so far has been able to combine both an accurate representation of Big Data jobs and realistic modeling of the behaviour of Big Data platforms at scale, including networking elements and data and job placement. In this paper, we propose BigDataNetSim, the first simulator which models accurately all the main components of the data movements in Big Data platforms (e.g., HDFS, YARN/MapReduce, network topologies, switching/routing protocols) in a large scale system. BigDataNetSim can serve as a valuable tool for engineering Big Data solutions, which includes set-up of systems, prototyping of jobs, and improvement of components/algorithms for Big Data platforms. We also demonstrate that BigDataNetSim can simulate a real Hadoop cluster with a high degree of accuracy in terms of data and job placements, being able to scale up to very large systems.

[1]  Flavio Villanustre Big data trends and evolution: a human perspective , 2014, RIIT '14.

[2]  Clemens A. Szyperski,et al.  Three Experts on Big Data Engineering , 2016, IEEE Softw..

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Ana Paula Couto da Silva,et al.  Performance Prediction of Cloud-Based Big Data Applications , 2018, ICPE.

[5]  Ting-Chao Hou,et al.  Bridge priority provisioning for maximizing equal cost shortest path availability , 2015, 2015 IEEE 16th International Conference on High Performance Switching and Routing (HPSR).

[6]  Maozhen Li,et al.  MRSim: A discrete event based MapReduce simulator , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[7]  Young Choon Lee,et al.  Resource provisioning for memory intensive graph processing , 2018, ACSW.

[8]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[9]  Hwangnam Kim,et al.  MR-CloudSim: Designing and implementing MapReduce computing model on CloudSim , 2012, 2012 International Conference on ICT Convergence (ICTC).

[10]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[11]  Luciana Arantes,et al.  MRSG - A MapReduce simulator over SimGrid , 2013, Parallel Comput..

[12]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[13]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[14]  Lei Yu,et al.  SimMapReduce: A Simulator for Modeling MapReduce Framework , 2011, 2011 Fifth FTRA International Conference on Multimedia and Ubiquitous Engineering.

[15]  Reena Panda,et al.  CAMP: Accurate modeling of core and memory locality for proxy generation of big-data applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[16]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[17]  Jorge-Arnulfo Quiané-Ruiz,et al.  Efficient Big Data Processing in Hadoop MapReduce , 2012, Proc. VLDB Endow..

[18]  Yuansong Qiao,et al.  Doopnet: An emulator for network performance analysis of Hadoop clusters using Docker and Mininet , 2016, 2016 IEEE Symposium on Computers and Communication (ISCC).

[19]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[20]  Einar Broch Johnsen,et al.  ABS-YARN: A Formal Framework for Modeling Hadoop YARN Clusters , 2016, FASE.

[21]  Weisong Shi,et al.  WaxElephant: A Realistic Hadoop Simulator for Parameters Tuning and Scalability Analysis , 2012, 2012 Seventh ChinaGrid Annual Conference.