190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

We present the results of a hierarchical N-body simulation on DEGIMA, a cluster of PCs with 576 graphic processing units (GPUs) and using an InfiniBand interconnect. DEGIMA stands for DEstination for GPU Intensive MAchine, and is located at Nagasaki Advanced Computing Center (NACC), Nagasaki University. In this work, we have upgraded DEGIMA_s interconnect using InfiniBand. DEGIMA is composed by 144 nodes with 576 GT200 GPUs. An astrophysical N-body simulation with 3,278,982,596 particles using a treecode algorithm shows a sustained performance of 190.5 Tflops on DEGIMA. The overall cost of the hardware was $411,921 dollars. The maximum corrected performance is 104.8 Tflops for the simulation, resulting in a cost performance of 254.4 MFlops/$. This corrections is performed by counting the FLOPS based on the most efficient CPU algorithm. Any extra FLOPS that arise from the GPU implementation and parameter differences are not included in the 254.4 MFLOPS/$.

[1]  Atsushi Kawai,et al.  $7.0/Mflops Astrophysical N-Body Simulation with Treecode on GRAPE-5 , 1999, SC.

[2]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[3]  David M. Beazley,et al.  Avalon: an Alpha/Linux cluster achieves 10 Gflops for $15k , 1998, SC '98.

[4]  Ryutaro Himeno,et al.  A 55 TFLOPS simulation of amyloid-forming peptides from yeast prion Sup35 with the special-purpose computer system MDGRAPE-3 , 2006, SC.

[5]  Robert G. Belleman,et al.  High Performance Direct Gravitational N-body Simulations on Graphics Processing Units , 2007, ArXiv.

[6]  Joshua E. Barnes,et al.  A modified tree code: don't laugh; it runs , 1990 .

[7]  Makoto Taiji,et al.  Astrophysical N-body simulations on the GRAPE-4 Special-Purpose Computer , 1995, SC.

[8]  Thomas L. Sterling,et al.  Pentium Pro Inside: I. A Treecode at 430 Gigaflops on ASCI Red, II. Price/Performance of $50/Mflop on Loki and Hyglac , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[9]  Simon Portegies Zwart,et al.  SAPPORO: A way to turn your graphics cards into a GRAPE-6 , 2009, ArXiv.

[10]  Mark J. Stock,et al.  Toward efficient GPU-accelerated N-body simulations , 2008 .

[11]  Tsuyoshi Hamada,et al.  The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units , 2007 .

[12]  Toshiyuki Fukushige,et al.  N-Boday Simulation of Galaxy Formation on GRAPE-4 Special-Purpose Computer , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[13]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[14]  Toshiyuki Fukushige,et al.  A 29.5 Tflops Simulation of Planetesimals in Uranus-Neptune Region on GRAPE-6 , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[15]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[16]  Toshiyuki Fukushige,et al.  Performance evaluation and tuning of GRAPE-6 - towards 40 "real" Tflops , 2003, SC.

[17]  Alan H. Karp Speeding up N-body Calculations on Machines without Hardware Square Root , 1992, Sci. Program..

[18]  Junichiro Makino,et al.  A Fast Parallel Treecode with GRAPE , 2004 .

[19]  Masaki Koga,et al.  A 1.349 Tflops simulation of black holes in a galactic center on GRAPE-6 , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[20]  Atsushi Kawai,et al.  $158/GFLOPS astrophysical N-body simulation with reconfigurable add-in card and hierarchical tree algorithm , 2006, SC.

[21]  Ramani Duraiswami,et al.  Fast multipole methods on graphics processors , 2008, J. Comput. Phys..

[22]  Tomonari Masada,et al.  A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation , 2009, Computer Science - Research and Development.

[23]  Makoto Taiji,et al.  42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[24]  Michael S. Warren,et al.  Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[25]  Thomas Sterling,et al.  Pentium Pro inside. 1; A treecode at 430 Gigaflops on ASCI Red , 1997 .