Efficient SIMDization and data management of the Lattice QCD computation on the Cell Broadband Engine

Lattice Quantum Chromodynamic (QCD) models subatomic interactions based on a four-dimensional discretized space-time continuum. The Lattice QCD computation is one of the grand challenges in physics especially when modeling a lattice with small spacing. In this work, we study the implementation of the main kernel routine of Lattice QCD that dominates the execution time on the Cell Broadband Engine. We tackle the problem of efficient SIMD execution and the problem of limited bandwidth for data transfers with the off-chip memory. For efficient SIMD execution, we present runtime data fusion technique that groups data processed similarly at runtime. We also introduce analysis needed to reduce the pressure on the scarce memory bandwidth that limits the performance of this computation. We studied two implementations for the main kernel routine that exhibit different patterns of accessing the memory and thus allowing different sets of optimizations. We show the attributes that make one implementation more favorable in terms of performance. For lattice size that is significantly larger than the local store, our implementation achieves 31.2 GFlops for single precision computations and 16.6 GFlops for double precision computations on the PowerXCell 8i, an order of magnitude better than the performance achieved on most general-purpose processors.

[1]  Karl Jansen,et al.  HMC algorithm with multiple time scale integration and mass preconditioning , 2006, Comput. Phys. Commun..

[2]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[3]  Robert G. Belleman,et al.  High Performance Direct Gravitational N-body Simulations on Graphics Processing Units , 2007, ArXiv.

[4]  Roberto Giacobazzi,et al.  A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware , 2007, SAS.

[5]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[6]  N. Eicker,et al.  QCD on the Cell Broadband Engine , 2007 .

[7]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[8]  A. Trew,et al.  Performance of a Lattice Quantum Chromodynamics kernel on the Cell processor , 2008, Comput. Phys. Commun..

[9]  Zoltán Fodor,et al.  Lattice QCD as a video game , 2007, Comput. Phys. Commun..

[10]  Carsten Urbach,et al.  Lattice QCD with two light Wilson quarks and maximally twisted mass , 2007, 0710.1517.

[11]  Khaled Z. Ibrahim,et al.  Fine-grained parallelization of lattice QCD kernel routine on GPUs , 2008, J. Parallel Distributed Comput..

[12]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[13]  Atsushi Nakamura,et al.  Development of QCD code on a CELL Machine , 2007 .

[14]  Nicolas Cuntz,et al.  Dynamic particle coupling for gpu-based fluid simulation , 2010 .

[15]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[16]  P. Vranas,et al.  The BlueGene/L Supercomputer and Quantum ChromoDynamics , 2006 .

[17]  Jun Doi Performance evaluation and tuning of lattice QCD on the next generation Blue Gene , 2007 .