GRAPE-MP: An SIMD Accelerator Board for Multi-precision Arithmetic

Abstract We describe the design and performance of the GRAPE-MP board, an SIMD accelerator board for quadrupleprecision arithmetic operations. A GRAPE-MP board houses one GRAPE-MP processor chip and an FPGA chip which handles the communication with the host computer. A GRAPE-MP chip has 6 processing elements (PE) and operates with 100 MHz clock cycle. Each PE can perform one addition and one multiplication in every clock cycle. The architecture of the GRAPE-MP is similar to that of the GRAPE-DR. It is implemented using the structured ASIC chip from eASIC corp. A GRAPE-MP processor board has the theoretical peak quadruple-precision performance of 1.2 Gflops. As a preliminary result, we present the performance of the GRAPE-MP board for two target applications. The performance of the numerical integration of Feynman loop is 0.53 Gflops. The performance of a N-body simulation with the second order leapfrog schema is 0.505 Gflops for N = 1984, which is more than 10 times faster than the performance of the host computer.

[1]  京都大学数理解析研究所,et al.  Publications of the Research Institute for Mathematical Sciences , 1965 .

[2]  T. J. Dekker,et al.  A floating-point technique for extending the available precision , 1971 .

[3]  Naohito Nakasato,et al.  A compiler for high performance computing with many-core accelerators , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[4]  P. Wynn,et al.  On the Convergence and Stability of the Epsilon Algorithm , 1966 .

[5]  M. Mori Discovery of the Double Exponential Transformation and Its Developments , 2005 .

[6]  Thomas L. Sterling,et al.  Pentium Pro Inside: I. A Treecode at 430 Gigaflops on ASCI Red, II. Price/Performance of $50/Mflop on Loki and Hyglac , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[7]  Naohito Nakasato,et al.  Application of Many-core Accelerators for Problems in Astronomy and Physics , 2011 .

[8]  J. Fujimoto,et al.  Precise Numerical Evaluation of the Scalar One-Loop Integrals with the Infrared Divergence , 2007, 0709.0777.

[9]  Toshiyuki Fukushige,et al.  GRAPE-6: Massively-Parallel Special-Purpose Computer for Astrophysical Particle Simulations , 2003, astro-ph/0310702.

[10]  J. Makino,et al.  GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems , 2005, astro-ph/0504407.

[11]  Xiaoye S. Li,et al.  Algorithms for quad-double precision floating point arithmetic , 2000, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[12]  Donald E. Knuth,et al.  The art of computer programming. Vol.2: Seminumerical algorithms , 1981 .

[13]  Robert D. Mathieu,et al.  Standardised units and time scales , 1986 .

[14]  Tsuyoshi Hamada,et al.  190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Donald E. Knuth,et al.  The Art of Computer Programming, Vol. 2 , 1981 .