Characterizing the Performance of Analytics Workloads on the Cray XC 40

This paper describes an investigation of the performance characteristics of high performance data analytics (HPDA) workloads on the Cray XC40TM, with a focus on commonly-used open source analytics frameworks like Apache Spark. We look at two types of Spark workloads: the Spark benchmarks from the Intel HiBench 4.0 suite and a CX matrix decomposition algorithm. We study performance from both the bottom-up view (via system metrics) and the top-down view (via application log analysis), and show how these two views can help identify performance bottlenecks and system issues impacting data analytics workload performance. Based on this study, we provide recommendations for improving the performance of analytics workloads on the XC40. Keywords-Spark; Cray XC40; data analytics; big data