The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems

CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and Sequoia systems. Summit and Sierra are currently ranked No. 1 and No. 3, respectively on the Top500 list. We discuss the design and key differences of the systems. Our evaluation of the systems highlights the following. Applications that fit in HBM see the most benefit and may prefer more GPUs; however, for some applications, the CPU-GPU bandwidth is more important than the number of GPUs. The node-local burst buffer scales linearly, and can achieve a 4X improvement over the parallel file system for large jobs; smaller jobs, however, may benefit from writing directly to the PFS. Finally, several CPU, network and memory bound analytics and GPU-bound deep learning codes achieve up to a 11X and 79X speedup/node, respectively over Titan.

[1]  Surendra Byna,et al.  Accelerating Science with the NERSC Burst Buffer Early User Program , 2016 .

[2]  Bei Wang,et al.  Performance Portability of HPC Discovery Science Software: Fusion Energy Turbulence Simulations at Extreme Scale , 2017, Supercomput. Front. Innov..

[3]  Devesh Tiwari,et al.  GUIDE: A Scalable Information Directory Service to Collect, Federate, and Analyze Logs for Operational Insights into a Leadership HPC Facility , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  G. Hagen,et al.  Structure of ^{78}Ni from First-Principles Computations. , 2016, Physical review letters.

[6]  David Appelhans,et al.  Leveraging NVLINK and asynchronous data transfer to scale beyond the memory capacity of GPUs , 2017, ScalA@SC.

[7]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[8]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Paul B. Schneck The CDC STAR-100 , 1987 .

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ibm Power Npu team Functionality and performance of NVLink with IBM POWER9 processors , 2018, IBM J. Res. Dev..

[12]  김종영 구글 TensorFlow 소개 , 2015 .

[13]  A Moody Contention-free Routing for Shift-based Communication in MPI Applications on Large-scale Infiniband Clusters , 2009 .

[14]  W R Smith The cray-1. , 1982, Science.

[15]  Kevin Harms,et al.  Impact of Burst Buffer Architectures on Application Portability , 2016 .

[16]  John Bent,et al.  PLFS: a checkpoint filesystem for parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.