Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages

With the shift to exascale computer systems, the importance of productive programming models for distributed systems is increasing. Partitioned Global Address Space (PGAS) programming models aim to reduce the complexity of writing distributed-memory parallel programs by introducing global operations on distributed arrays, distributed task parallelism, directed synchronization, and mutual exclusion. However, a key challenge in the application of PGAS programming models is the improvement of compilers and runtime systems. In particular, one open question is how runtime systems meet the requirement of exascale systems, where a large number of asynchronous tasks are executed. While there are various tasking runtimes such as Qthreads, OCR, and HClib, there is no existing comparative study on PGAS tasking/threading runtime systems. To explore runtime systems for PGAS programming languages, we have implemented OCR-based and HClib-based Chapel runtimes and evaluated them with an initial focus on tasking and synchronization implementations. The results show that our OCR and HClib-based implementations can improve the performance of PGAS programs compared to the existing Qthreads backend of Chapel.

[1]  David F. Richards,et al.  Optimizing PGAS Overhead in a Multi-locale Chapel Implementation of CoMD , 2016, 2016 PGAS Applications Workshop (PAW).

[2]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[3]  Alexander Aiken,et al.  Realm: An event-based low-level runtime for distributed memory architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[4]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[6]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[7]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with OpenSHMEM , 2016, OpenSHMEM.

[8]  Benoît Meister,et al.  The Open Community Runtime: A runtime system for extreme scale computing , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[9]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[10]  Katherine A. Yelick,et al.  UPC++: A PGAS Extension for C++ , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11]  Barbara M. Chapman,et al.  Introducing OpenSHMEM: SHMEM for the PGAS community , 2010, PGAS '10.

[12]  Vivek Sarkar,et al.  HabaneroUPC++: a Compiler-free PGAS Library , 2014, PGAS.

[13]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[16]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[17]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[18]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[19]  Vivek Sarkar,et al.  Optimized Distributed Work-Stealing , 2016, 2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3).