Managing Asynchronous Operations in Coarray Fortran 2.0

As the gap between processor speed and network latency continues to increase, avoiding exposed communication latency is critical for high performance on modern supercomputers. One can hide communication latency by overlapping it with computation using non-blocking data transfers, or avoid exposing communication latency by moving computation to the location of data it manipulates. Coarray Fortran 2.0 (CAF 2.0) - a partitioned global address space language - provides a rich set of asynchronous operations for avoiding exposed latency including asynchronous copies, function shipping, and asynchronous collectives. CAF 2.0 provides event variables to manage completion of asynchronous operations that use explicit completion. This paper describes CAF 2.0's finish and cofence synchronization constructs, which enable one to manage implicit completion of asynchronous operations. finish ensures global completion of a set of asynchronous operations across the members of a team. Because of CAF 2.0's SPMD model, its semantics and implementation of finish differ significantly from those of finish in X10 and HabaneroC. cofence controls local data completion of implicitlysynchronized asynchronous operations. Together these constructs provide the ability to tune a program's performance by exploiting the difference between local data completion, local operation completion, and global completion of asynchronous operations, while hiding network latency. We explore subtle interactions between cofence, finish, events, asynchronous copies and collectives, and function shipping. We justify their presence in a relaxed memory model for CAF 2.0. We demonstrate the utility of these constructs in the context of two benchmarks: Unbalanced Tree Search (UTS), and HPC Challenge RandomAccess. We achieve 74%-77% parallel efficiency for 4K-32K cores for UTS using the T1WL spec, which demonstrates scalable performance using our synchronization constructs. Our cofence micro-benchmark shows that for a producer-consumer scenario, using local data completion rather than local operation completion yields superior performance.

[1]  M. RodehiswithIBMIsrael Achieving Distributed Termination without Freezing , 1982 .

[2]  Friedemann Mattern,et al.  Algorithms for distributed termination detection , 1987, Distributed Computing.

[3]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Devendra Kumar,et al.  A Class of Termination Detection Algorithms For Distributed Computation , 1985, FSTTCS.

[5]  Katherine Yelick,et al.  Hierarchical Work Stealing on Manycore Clusters , 2011 .

[6]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[7]  W. H. J. Feijen,et al.  Derivation of a termination detection algorithm for distributed computations , 1986 .

[8]  Stephen L. Olivier,et al.  UTS: An Unbalanced Tree Search Benchmark , 2006, LCPC.

[9]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[10]  William N. Scherer,et al.  A new vision for coarray Fortran , 2009, PGAS '09.

[11]  William N. Scherer,et al.  A Critique of Co-array Features in Fortran 2008 Working Draft J3/07-007r3 , 2008 .

[12]  Santosh Pande,et al.  Work Stealing for Multi-core HPC Clusters , 2011, Euro-Par.

[13]  Torsten Hoefler,et al.  AM++: A generalized active message framework , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14]  Sriram Krishnamoorthy,et al.  Lifeline-based global load balancing , 2011, PPoPP '11.

[15]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[16]  Sarita V. Adve,et al.  Designing memory consistency models for shared-memory multiprocessors , 1993 .

[17]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[18]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[19]  John Reid Coarrays in Fortran 2008 , 2009, PGAS '09.

[20]  Dhabaleswar K. Panda,et al.  High Performance Remote Memory Access Communication: The Armci Approach , 2006, Int. J. High Perform. Comput. Appl..

[21]  William N. Scherer,et al.  Hiding latency in Coarray Fortran 2.0 , 2010, PGAS '10.

[22]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[23]  Jayadev Misra,et al.  A Paradigm for Detecting Quiescent Properties in Distributed Computations , 1989, Logics and Models of Concurrent Systems.

[24]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[25]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[26]  Sriram Krishnamoorthy,et al.  Scioto: A Framework for Global-View Task Parallelism , 2008, 2008 37th International Conference on Parallel Processing.

[27]  Steve Furber,et al.  ARM System Architecture , 1996 .

[28]  Chaoran Yang,et al.  Function Shipping in a Scalable Parallel Programming Model , 2012 .

[29]  William N. Scherer,et al.  Implementation and Performance Evaluation of the HPC Challenge Benchmarks in Coarray Fortran 2.0 , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[30]  Nissim Francez,et al.  Distributed Termination , 1980, TOPL.