Lattice-CSC: Optimizing and Building an Efficient Supercomputer for Lattice-QCD and to Achieve First Place in Green500

In the last decades, supercomputers have become a necessity in science and industry. Huge data centers consume enormous amounts of electricity and we are at a point where newer, faster computers must no longer drain more power than their predecessors. The fact that user demand for compute capabilities has not declined in any way has led to studies of the feasibility of exaflop systems. Heterogeneous clusters with highly-efficient accelerators such as GPUs are one approach to higher efficiency. We present the new L-CSC cluster, a commodity hardware compute cluster dedicated to Lattice QCD simulations at the GSI research facility. L-CSC features a multi-GPU design with four FirePro S9150 GPUs per node providing 320 GB/s memory bandwidth and 2.6 TFLOPS peak performance each. The high bandwidth makes it ideally suited for memory-bound LQCD computations while the multi-GPU design ensures superior power efficiency. The November 2014 Green500 list awarded L-CSC the most power-efficient supercomputer in the world with 5270 MFLOPS/W in the Linpack benchmark. This paper presents optimizations to our Linpack implementation HPL-GPU and other power efficiency improvements which helped L-CSC reach this benchmark. It describes our approach for an accurate Green500 power measurement and unveils some problems with the current measurement methodology. Finally, it gives an overview of the Lattice QCD application on L-CSC.

[1]  Volker Lindenstruth,et al.  A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU Systems , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[2]  Alan Gara,et al.  QCDOC: A 10 Teraflops Computer for Tightly-Coupled Calculations , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[3]  Rajan Gupta Introduction to lattice QCD , 1998, hep-lat/9807028.

[4]  Volker Lindenstruth,et al.  A Comprehensive Approach for a Power Efficient General Purpose Supercomputer , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[5]  Steven A. Gottlieb,et al.  Scaling lattice QCD beyond 100 GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Claude Gomez,et al.  QPACE - a QCD parallel computer based on Cell processors , 2009, ArXiv.

[7]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[8]  A. Sciarra,et al.  Nature of the Roberge-Weiss transition in N f = 2 QCD with Wilson fermions , 2014, 1402.0838.

[9]  Volker Lindenstruth,et al.  An Energy-Efficient Multi-GPU Supercomputer , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[10]  Thomas Sterling,et al.  How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters 2nd Printing , 1999 .

[11]  Volker Lindenstruth,et al.  Multi-GPU DGEMM and High Performance Linpack on Highly Energy-Efficient Clusters , 2011, IEEE Micro.

[12]  Volker Lindenstruth,et al.  Optimized HPL for AMD GPU and multi-core CPU usage , 2011, Computer Science - Research and Development.

[13]  Pradeep Dubey,et al.  High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Bálint Joó,et al.  A Framework for Lattice QCD Calculations on GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Pier Stanislao Paolucci,et al.  The APE-100 Computer: (I) the Architecture , 1993, Int. J. High Speed Comput..

[16]  Volker Lindenstruth,et al.  Lattice QCD based on OpenCL , 2012, Comput. Phys. Commun..

[17]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[18]  V. Lindenstruth,et al.  Twisted-Mass Lattice QCD using OpenCL , 2014 .

[19]  Wu-chun Feng,et al.  Making a case for a Green500 list , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[20]  M. Bach,et al.  CL2QCD – Lattice QCD based on OpenCL , 2014, 1411.5219.