PGAS with Lightweight Threads and the Barnes-Hut Algorithm

We describe a novel runtime system that integrates lightweight threads with a partitioned global address space (PGAS) mode of computation and apply it to the Barnes-Hut (BH) algorithm. Our model combines the power of low-latency, zero-copy, one-sided communication via PGAS with the power of fast context-switching and user-managed preemptive lightweight threads into a hybrid interface. We describe the challenges in designing such a runtime system, analyze approaches and trade-offs, and present benchmark results. Our BH application exemplifies the usage of the model and shows how we can obtain a simple, yet efficient and scalable, algorithm. Our implementation improves on a stateof-the-art implementation by up to 13 times. The hybrid model also improves the performance of various multi-threaded micro-benchmarks on a distributed memory cluster. Keywords-Barnes-Hut, PGAS, Lightweight thread, Qthreads

[1]  José Nelson Amaral,et al.  Hybrid parallel task placement in X10 , 2013, X10 '13.

[2]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[3]  Truong Vinh Truong Duy,et al.  Hybrid MPI-OpenMP Paradigm on SMP Clusters: MPEG-2 Encoder and N-Body Simulation , 2012, ArXiv.

[4]  Rolf Krause,et al.  A massively parallel, multi-disciplinary Barnes-Hut tree code for extreme-scale N-body simulations , 2012, Comput. Phys. Commun..

[5]  H. Rein,et al.  REBOUND: An open-source multi-purpose N-body code for collisional dynamics , 2011, 1110.4876.

[6]  Richard C. Murphy,et al.  The Chapel Tasking Layer Over Qthreads. , 2011 .

[7]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[8]  David Mizell,et al.  Early experiences with large-scale Cray XMT systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[11]  Georg Hager,et al.  Hybrid MPI and OpenMP Parallel Programming , 2006, PVM/MPI.

[12]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[13]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[15]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[16]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[17]  John K. Salmon,et al.  Parallel hierarchical N-body methods , 1992 .

[18]  Burton J. Smith Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[19]  Roland Wielen,et al.  A comparison of numerical methods for the study of star cluster dynamics , 1974 .