The performance of GRAPE-DR for dense matrix operations

Abstract We describe the implementation and performance of dense matrix multiplication and LU decomposition on the GRAPE-DR SIMD accelerator board. A GRAPE-DR card, with 4 GRAPE-DR chips, has the theoretical peak DP performance of 819 Gflops. Each GRAPE-DR chip has 512 processing elements and operates with 400 MHz clock cycle. each PE can perform one addition and one multiplication in every two clock cycles. The measured performance of matrix multiplication is 730 Gflops for the multiplication of matrices with size 51200 by 2048 and 2048 by 51200. The performance of LU decomposition is 480 Gflops for the problem size of 51200.