lpt: A Tool for Tuning the Level of Parallelism of Spark Applications

Spark is increasingly becoming the platform of choice for several big-data analyses mainly due to its fast, fault-tolerant, and in-memory processing model. Despite the popularity and maturity of the Spark framework, tuning Spark applications to achieve high performance remains challenging. In this paper, we present lpt, a novel tool that assists users in improving the level of parallelism of applications running on top of Spark in the local mode. lpt helps users tune the level of parallelism of Spark applications to spawn a number of tasks able to fully exploit the available computing resources. Our evaluation results show that optimizations guided by lpt can achieve speedups up to 2.72x.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  A. Davidson Optimizing Shuffle Performance in Spark , 2013 .

[3]  Jordi Torres,et al.  Spark Parameter Tuning via Trial-and-Error , 2016, INNS Conference on Big Data.

[4]  Ben He,et al.  A Novel Method for Tuning Configuration Parameters of Spark Based on Machine Learning , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[5]  Eduardo Rosales,et al.  tgp: A Task-Granularity Profiler for the Java Virtual Machine , 2017, 2017 24th Asia-Pacific Software Engineering Conference (APSEC).

[6]  Kishori Sharan Java Native Interface , 2014 .

[7]  Walter Binder,et al.  ShadowVM: robust and comprehensive dynamic program analysis for the java platform , 2014 .

[8]  Walter Binder,et al.  DiSL: a domain-specific language for bytecode instrumentation , 2012, AOSD.

[9]  Jun Li,et al.  Sparkle: optimizing spark for large memory machines and analytics , 2017, SoCC.

[10]  Matthias Hauswirth,et al.  Vertical profiling: understanding the behavior of object-priented applications , 2004, OOPSLA.

[11]  Yao Zhao,et al.  An adaptive tuning strategy on spark based on in-memory computation characteristics , 2016, 2016 18th International Conference on Advanced Communication Technology (ICACT).

[12]  Hong Zhang,et al.  Tuning Performance of Spark Programs , 2018, 2018 IEEE International Conference on Cloud Engineering (IC2E).

[13]  Walter Binder,et al.  The JVM is not observable enough (and what to do about it) , 2012, VMIL '12.