This paper presents an approach to adaptation of the doubleprecision matrix multiplication to the architecture of Cell processors. The algorithm used for the adaptation on a single SPE is based on C = C+A*B operation performed for matrices of size 64×64; these matrices are further divided into smaller submatrices which correspond to micro-kernel operations. Our approach is based on a performance model which is constructed as a function of submatrix size. The model accounts for such factors as size of local storage, number of registers, properties of double-precision operations, balance between pipelines, etc. This approach allows us to take into consideration properties of the first generation of Cell processors and its successor - PowerXCell 8i.
This adaptation is followed by an optimization phase which includes loop transformations, kernel implementation with SIMD instructions, and other transformations necessary to achieve balance between even and odd pipelines. Finally we present hand-tunings performed with the IBM Assembly Visualizer tool. The proposed adaptation and optimizations allow us to achieve about 96% of the peak performance.
[1]
Jason N. Dale,et al.
Cell Broadband Engine Architecture and its first implementation - A performance view
,
2007,
IBM J. Res. Dev..
[2]
Paul R. Woodward,et al.
Moving Scientific Codes to Multicore Microprocessor CPUs
,
2008,
Computing in Science & Engineering.
[3]
Hiroaki Kobayashi,et al.
A Performance Study of Secure Data Mining on the Cell Processor
,
2009,
Int. J. Grid High Perform. Comput..
[4]
John A. Gunnels,et al.
Programming the Linpack benchmark for the IBM PowerXCell 8i processor
,
2009,
Sci. Program..
[5]
Samuel Williams,et al.
The potential of the cell processor for scientific computing
,
2005,
CF '06.
[6]
Jack J. Dongarra,et al.
Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor
,
2009,
Parallel Comput..