Performance Evaluation of Apache Spark on Cray XC Systems

We report our experiences in porting and tuning the Apache Spark data analytics framework on the Cray XC30 (Edison) and XC40 (Cori) systems, installed at NERSC. Spark has been designed for cloud environments where local disk I/O is cheap and performance is constrained by the network latency. In large HPC systems diskless nodes are connected by fast networks: without careful tuning Spark execution is dominated by I/O performance. In default mode the centralized storage system, such as Lustre, results in metadata access latency being a major bottleneck that severely constrains scalability. We show how to mitigate this by using per-node loopback filesystems for temporary storage. With this technique, we reduce the communication (data shuffle) time by multiple orders of magnitude and improve the application scalability from O(100) to O(10, 000) cores on Cori. With this configuration Spark’s execution becomes again network dominated. This reflects in the performance comparison with a cluster with fast local SSDs, specifically designed for data intensive workloads. Due to slightly faster processor and better network, Cori provides performance better by an average of 13.7% for the machine learning benchmark suite. This is the first such result where HPC systems outperform systems designed for data intensive workloads. Overall, we believe this paper demonstrates that local disks are not necessary for good performance on data analytics workloads. Keywords-Spark; Berkeley Data Analytics Stack; Cray XC; Lustre; Shifter