42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical N-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous N-body simulations on GPUs that scale as O(N2), the present method calculates the O(N log N) treecode and O(N) fast multipole method (FMM) on the GPUs with unprecedented efficiency. We demonstrate the performance of our method by choosing one standard application -a gravitational N-body simulation- and one non-standard application -simulation of turbulence using vortex particles. The gravitational simulation using the treecode with 1,608,044,129 particles showed a sustained performance of 42.15 TFlops. The vortex particle simulation of homogeneous isotropic turbulence using the periodic FMM with 16,777,216 particles showed a sustained performance of 20.2 TFlops. The overall cost of the hardware was 228,912 dollars. The maximum corrected performance is 28.1TFlops for the gravitational simulation, which results in a cost performance of 124 MFlops/$. This correction is performed by counting the Flops based on the most efficient CPU algorithm. Any extra Flops that arise from the GPU implementation and parameter differences are not included in the 124 MFlops/$.

[1]  Toshiyuki Fukushige,et al.  A 29.5 Tflops Simulation of Planetesimals in Uranus-Neptune Region on GRAPE-6 , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[2]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[3]  Thomas Sterling,et al.  Pentium Pro inside. 1; A treecode at 430 Gigaflops on ASCI Red , 1997 .

[4]  Toshiyuki Fukushige,et al.  Performance evaluation and tuning of GRAPE-6 - towards 40 "real" Tflops , 2003, SC.

[5]  Atsushi Kawai,et al.  $7.0/Mflops Astrophysical N-Body Simulation with Treecode on GRAPE-5 , 1999, SC.

[6]  Shinnosuke Obi,et al.  Calculation of isotropic turbulence using a pure Lagrangian vortex method , 2007, J. Comput. Phys..

[7]  J. Monaghan,et al.  Smoothed particle hydrodynamics: Theory and application to non-spherical stars , 1977 .

[8]  Masaki Koga,et al.  A 1.349 Tflops simulation of black holes in a galactic center on GRAPE-6 , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9]  Alan H. Karp Speeding up N-body Calculations on Machines without Hardware Square Root , 1992, Sci. Program..

[10]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[11]  Mark J. Stock,et al.  Toward efficient GPU-accelerated N-body simulations , 2008 .

[12]  Makoto Taiji,et al.  Astrophysical N-body simulations on the GRAPE-4 Special-Purpose Computer , 1995, SC.

[13]  Joshua E. Barnes,et al.  A modified tree code: don't laugh; it runs , 1990 .

[14]  Jun Makino,et al.  Performance and accuracy of a GRAPE‐3 system for collisionless N‐body simulations , 1998 .

[15]  Tsuyoshi Hamada,et al.  The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units , 2007 .

[16]  Thomas L. Sterling,et al.  Pentium Pro Inside: I. A Treecode at 430 Gigaflops on ASCI Red, II. Price/Performance of $50/Mflop on Loki and Hyglac , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[17]  Junichiro Makino,et al.  A Fast Parallel Treecode with GRAPE , 2004 .

[18]  Tomonari Masada,et al.  A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation , 2009, Computer Science - Research and Development.

[19]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[20]  Junichiro Makino,et al.  Performance Tuning of N-Body Codes on Modern Microprocessors: I. Direct Integration with a Hermite Scheme on x86_64 Architecture , 2006 .

[21]  R. Rogallo Numerical experiments in homogeneous turbulence , 1981 .

[22]  Ramani Duraiswami,et al.  Fast multipole methods on graphics processors , 2008, J. Comput. Phys..

[23]  Petros Koumoutsakos,et al.  Vortex Methods: Theory and Practice , 2000 .

[24]  Simon Portegies Zwart,et al.  SAPPORO: A way to turn your graphics cards into a GRAPE-6 , 2009, ArXiv.

[25]  Toshiyuki Fukushige,et al.  N-Boday Simulation of Galaxy Formation on GRAPE-4 Special-Purpose Computer , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[26]  Atsushi Kawai,et al.  $158/GFLOPS astrophysical N-body simulation with reconfigurable add-in card and hierarchical tree algorithm , 2006, SC.

[27]  Michael S. Warren,et al.  Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[28]  David M. Beazley,et al.  Avalon: an Alpha/Linux cluster achieves 10 Gflops for $15k , 1998, SC '98.

[29]  Ryutaro Himeno,et al.  A 55 TFLOPS simulation of amyloid-forming peptides from yeast prion Sup35 with the special-purpose computer system MDGRAPE-3 , 2006, SC.

[30]  Robert G. Belleman,et al.  High Performance Direct Gravitational N-body Simulations on Graphics Processing Units , 2007, ArXiv.

[31]  Junichiro Makino,et al.  Treecode with a Special-Purpose Processor , 1991 .