Simulating spark cluster for deployment planning, evaluation and optimization

As the most active project in the Hadoop ecosystem these days (Zaharia, 2014), Spark is a fast and general purpose engine for large-scale data processing. Thanks to its advanced Directed Acyclic Graph (DAG) execution engine and in-memory computing mechanism, Spark runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk (Apache, 2016). However, Spark performance is impacted by many system software, hardware and dataset factors especially memory and JVM related, which makes capacity planning and tuning for Spark clusters extremely difficult. Current planning methods are mostly estimation based and are highly dependent on experience and trial-and-error. These approaches are far from efficient and accurate, especially with increasing software stack complexity and hardware diversity. Here, we propose a novel Spark simulator based on CSMethod (Bian et al., 2014), extension with a fine-grained multi-layered memory subsystem, well suitable for Spark cluster deployment planning, performance evaluation and optimization before system provisioning. The whole Spark application execution life cycle is simulated by the proposed simulator, including DAG generation, Resilient Distributed Dataset (RDD) processing and block management. Hardware activities derived from these software operations are dynamically mapped onto architecture models for processors, storage, and network devices. Performance behaviour of cluster memory system at multiple layers (Spark, JVM, OS, hardware) are modeled as an enhanced fine-grained individual global library. Experimental results with several popular Spark micro benchmarks and a real case IoT workloads demonstrate that our Spark Simulator achieves high accuracy with an average error rate below 7%. With light weight computing resource requirement (a laptop is enough) our simulator runs at the same speed level than native execution on multi-node high-end cluster.

[1]  Luciana Arantes,et al.  MRSG - A MapReduce simulator over SimGrid , 2013, Parallel Comput..

[2]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[3]  Rolf Riesen,et al.  Instruction-level simulation of a cluster at scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Guanying Wang,et al.  A simulation approach to evaluating design decisions in MapReduce setups , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[5]  Wei Zhou,et al.  Simulating Big Data Clusters for System Planning, Evaluation, and Optimization , 2014, 2014 43rd International Conference on Parallel Processing.

[6]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[7]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[8]  T. V. Gopal,et al.  A MR Simulator in Facilitating Cloud Computing , 2013 .

[9]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.