Workload characterization and optimization of TPC-H queries on Apache Spark

Besides being an in-memory-oriented computing framework, Spark runs on top of Java Virtual Machines (JVMs), so JVM parameters must be tuned to improve Spark application performance. Misconfigured parameters and settings degrade performance. For example, using Java heaps that are too large often causes a long garbage collection pause time, which accounts for over 10-20% of application execution time. Moreover, recent computing nodes have many cores with simultaneous multi-threading technology and the processors on the node are connected via NUMA, so it is difficult to exploit best performance without taking into account of these hardware features. Thus, optimization in a full stack is also important. Not only JVM parameters but also OS parameters, Spark configuration, and application code based on CPU characteristics need to be optimized to take full advantage of underlying computing resources. In this paper, we used the TPC-H benchmark as our optimization case study and gathered many perspective logs such as application, JVM (e.g. GC and JIT), system utilization, and hardware events from a performance monitoring unit. We discuss current problems and introduce several JVM and OS parameter optimization approaches for accelerating Spark performance. As a result, our optimization exhibits 30-40% increase in speed on average and is up to 5x faster than the naive configuration.

[1]  Kiyokuni Kawachiya,et al.  Lock reservation: Java locks can mostly do without atomic operations , 2002, OOPSLA '02.

[2]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[3]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[4]  Dhabaleswar K. Panda,et al.  Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[5]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Reynold Xin,et al.  Scaling Spark in the Real World: Performance and Usability , 2015, Proc. VLDB Endow..

[8]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[9]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[10]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11]  Li Zhang,et al.  SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark , 2015, Conf. Computing Frontiers.

[12]  A. Davidson Optimizing Shuffle Performance in Spark , 2013 .

[13]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[14]  Kiyokuni Kawachiya,et al.  A study of locking objects with bimodal fields , 1999, OOPSLA '99.

[15]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Arnaldo Carvalho de Melo,et al.  The New Linux ’ perf ’ Tools , 2010 .

[17]  Chen Wang,et al.  Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics , 2015, Proc. VLDB Endow..

[18]  Umar Farooq Minhas,et al.  SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures , 2014, Proc. VLDB Endow..