Implementing Wilson-Dirac operator on the cell broadband engine

Computing the actions of Wilson-Dirac operator contributes most of the CPU time for the grand challenge problem of simulating Lattice Quantum Chromodynamics (Lattice QCD). This routine exhibits many challenges in implementation on most computational environments because of the multiple patterns of accessing the same data, making it difficult to align the data efficiently at compile time. Additionally, the low computation to memory access ratio makes this computation bounded by the memory bandwidth and the memory latency. In this work, we present an implementation of this routine on the Cell Broadband Engine. We propose runtime data fusion, an approach that aims at re-aligning data at runtime, for data that cannot be aligned optimally at compile time, thus improving the performance of SIMDized execution. We also show a DMA optimization technique that reduces the impact of bandwidth limits on performance. Our implementation for this routine achieves 31.2 GFlops for single precision computations and 8.75 GFlops for double precision computations.

[1]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[2]  Philip Heidelberger,et al.  The BlueGene/L supercomputer and quantum ChromoDynamics , 2006, SC.

[3]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[4]  Nicolas Cuntz,et al.  Dynamic particle coupling for gpu-based fluid simulation , 2010 .

[5]  Barbara Horner-Miller,et al.  Proceedings of the 2006 ACM/IEEE conference on Supercomputing , 2006 .

[6]  Roberto Giacobazzi,et al.  A Fast Implementation of the Octagon Abstract Domain on Graphics Hardware , 2007, SAS.

[7]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[8]  N. Eicker,et al.  QCD on the Cell Broadband Engine , 2007 .

[9]  Simon Portegies Zwart,et al.  High-performance direct gravitational N-body simulations on graphics processing units , 2007, astro-ph/0702058.

[10]  Zoltán Fodor,et al.  Lattice QCD as a video game , 2007, Comput. Phys. Commun..

[11]  Khaled Z. Ibrahim,et al.  Fine-grained parallelization of lattice QCD kernel routine on GPUs , 2008, J. Parallel Distributed Comput..

[12]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[13]  Jun Doi Performance evaluation and tuning of lattice QCD on the next generation Blue Gene , 2007 .

[14]  Atsushi Nakamura,et al.  Development of QCD code on a CELL Machine , 2007 .

[15]  Karl Jansen,et al.  HMC algorithm with multiple time scale integration and mass preconditioning , 2006, Comput. Phys. Commun..