Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.

[1]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[2]  Ramesh C. Agarwal,et al.  Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms , 1994, IBM J. Res. Dev..

[3]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[4]  Erdem Hokenek,et al.  Design of the IBM RISC System/6000 Floating-Point Execution Unit , 1990, IBM J. Res. Dev..

[5]  Burkhard D. Steinmacher-Burow,et al.  Cellular supercomputing with system-on-a-chip , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[6]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[7]  Chia-Lin Yang,et al.  Exploiting Parallelism in Geometry Processing with General Purpose Processors and Floating-Point SIMD Instructions , 2000, IEEE Trans. Computers.

[8]  Guido D. Salvucci,et al.  Ieee standard for binary floating-point arithmetic , 1985 .

[9]  Philip Heidelberger,et al.  Cellular supercomputing with system-on-a-chip , 2002 .

[10]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[11]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[12]  Fred G. Gustavson,et al.  New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms , 2000, The Architecture of Scientific Software.

[13]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[14]  John Cocke,et al.  Register allocation via graph coloring , 1981 .

[15]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[16]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[17]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[18]  Ansi Ieee,et al.  IEEE Standard for Binary Floating Point Arithmetic , 1985 .

[19]  John A. Gunnels,et al.  A high-performance SIMD floating point unit for BlueGene/L: architecture, compilation, and algorithm design , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..