Lattice QCD on Intel Xeon Phi

The Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in theory of the strong interactions, and is of importance in studies of nuclear and high energy physics. In this contribution, we describe our experiences with optimizing a key LQCD kernel for the Xeon Phi architecture. On a single node, our Dslash kernel sustains a performance of around 280 GFLOPS, while our full solver sustains around 215 GFLOPS. Furthermore we demonstrate a fully ’native’ multi-node LQCD implementation running entirely on KNC nodes with minimum involvement of the host CPU. Our multi-node implementation of the solver has been strong scaled to 3.6 TFLOPS on 64 KNCs.

[1]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Dong Chen,et al.  QCDSP machines: design, performance and cost , 1998, SC '98.

[3]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[4]  Bálint Joó SciDAC-2 software infrastructure for lattice QCD , 2007 .

[5]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[6]  Jun Doi Peta-scale Lattice Quantum Chromodynamics on a Blue Gene/Q supercomputer , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Peter A. Boyle,et al.  The BlueGene/Q supercomputer , 2012 .

[8]  Craig Pelissier,et al.  Efficient Implementation of the Overlap Operator on Multi-GPUs , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[9]  Antonino Zichichi,et al.  New phenomena in subnuclear physics , 1977 .

[10]  Xipeng Shen,et al.  Implementing the Dslash Operator in OpenCL , 2010 .

[11]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[12]  Peter A. Boyle,et al.  The BAGEL assembler generation library , 2009, Comput. Phys. Commun..

[13]  Jie Chen,et al.  GMH: A Message Passing Toolkit for GPU Clusters , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[14]  Volker Lindenstruth,et al.  Lattice QCD based on OpenCL , 2012, Comput. Phys. Commun..

[15]  Andrew Pochinsky,et al.  Writing Efficient QCD Code Made Simpler: qa0 , 2009 .

[16]  M. A. Clark,et al.  High-efficiency Lattice QCD computations on the Fermi architecture , 2012, 2012 Innovative Parallel Computing (InPar).

[17]  Pradeep Dubey,et al.  High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[18]  Bálint Joó,et al.  Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[20]  Philip Heidelberger,et al.  The BlueGene/L supercomputer and quantum ChromoDynamics , 2006, SC.

[21]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[22]  Kenneth G. Wilson,et al.  Quarks and Strings on a Lattice , 1977 .

[23]  Michael Lang,et al.  The reverse-acceleration model for programming petascale hybrid systems , 2009, IBM J. Res. Dev..

[24]  Barbara Horner-Miller,et al.  Proceedings of the 2006 ACM/IEEE conference on Supercomputing , 2006 .

[25]  Robert Strzodka,et al.  Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.