Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance, aside from special purpose accelerators like Graphics Processing Units (GPUs). In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crucial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C=C-AxB^T operation and the C=C-AxB operation for matrices of size 64x64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80% of the peak, using as little as 5.9 kB of storage for code and auxiliary data structures.

[1]  V. Strassen Gaussian elimination is not optimal , 1969 .

[2]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[3]  gazon synthétique,et al.  Operations , 1961 .

[4]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[5]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[6]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[7]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[8]  Jack Dongarra,et al.  1. High-Performance Computing , 1998 .

[9]  D. Geer,et al.  Chip makers turn to multicore processors , 2005, Computer.

[10]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[11]  W. Paul,et al.  Computer Architecture , 2000, Springer Berlin Heidelberg.

[12]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[13]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[14]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[15]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[16]  ChenT.,et al.  Cell Broadband Engine Architecture and its first implementation—A view , 2007 .

[17]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[18]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[19]  DongarraJack,et al.  A class of parallel tiled linear algebra algorithms for multicore architectures , 2009 .

[20]  DongarraJack,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008 .

[21]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[22]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[23]  Jack J. Dongarra,et al.  Implementation of mixed precision in solving systems of linear equations on the Cell processor , 2007, Concurr. Comput. Pract. Exp..

[24]  DongarraJack,et al.  Parallel tiled QR factorization for multicore architectures , 2008 .

[25]  Jaewook Shin,et al.  Exploiting Superword-Level Locality in Multimedia Extension Architectures , 2003, J. Instr. Level Parallelism.

[26]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[27]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[28]  B. Flachs,et al.  The microarchitecture of the synergistic processor for a cell processor , 2006, IEEE Journal of Solid-State Circuits.

[29]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[30]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[31]  Juan J. Navarro,et al.  Using Non-canonical Array Layouts in Dense Matrix Operations , 2006, PARA.

[32]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[33]  Jack Dongarra,et al.  Numerical Linear Algebra for High-Performance Computers , 1998 .

[34]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[35]  Douglas Aberdeen,et al.  Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions , 2001, Concurr. Comput. Pract. Exp..