论文信息 - Scaling Spark on HPC Systems

Scaling Spark on HPC Systems

We report our experiences porting Spark to large production HPC systems. While Spark performance in a data center installation (with local disks) is dominated by the network, our results show that file system metadata access latency can dominate in a HPC installation using Lustre: it determines single node performance up to 4x slower than a typical workstation. We evaluate a combination of software techniques and hardware configurations designed to address this problem. For example, on the software side we develop a file pooling layer able to improve per node performance up to 2.8x. On the hardware side we evaluate a system with a large NVRAM buffer between compute nodes and the backend Lustre file system: this improves scaling at the expense of per-node performance. Overall, our results indicate that scalability is currently limited to O(102) cores in a HPC installation with Lustre and default Spark. After careful configuration combined with our pooling we can scale up to O(10^4). As our analysis indicates, it is feasible to observe much higher scalability in the near future.

[1] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[2] Garth A. Gibson,et al. HPC Computation on Hadoop Storage with PLFS , 2012 .

[3] D. Jacobsen,et al. Contain This, Unleashing Docker for HPC , 2015 .

[4] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5] Randy H. Katz,et al. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[6] Matei Zaharia,et al. Resilient Distributed Datasets , 2016 .

[7] Robert B. Ross,et al. On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8] Marianne Winslett,et al. A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[9] Michael J. Franklin,et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10] Robert B. Ross,et al. Small-file access in parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11] Dhabaleswar K. Panda,et al. Accelerating Spark with RDMA for Big Data Processing: Early Experiences , 2014, 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects.

[12] Scott Shenker,et al. Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[13] Dhabaleswar K. Panda,et al. High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14] Weikuan Yu,et al. Hadoop acceleration through network levitated merge , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15] Joseph K. Bradley,et al. Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[16] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[17] Scott Shenker,et al. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[18] Sadaf R. Alam,et al. Parallel I/O and the metadata wall , 2011, PDSW '11.

[19] Ameet Talwalkar,et al. MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[20] Meikel Pöss,et al. TPC-DS, taking decision support benchmarking to the next level , 2002, SIGMOD '02.

[21] Gabriel Antoniu,et al. Understanding Spark Performance in Hybrid and Multi-Site Clouds , 2015 .

[22] Reynold Xin,et al. GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[23] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.