An Embedded DSL for High Performance Declarative Communication with Correctness Guarantees in C++

High performance programming using explicit communication calls needs considerable programming expertise to optimize. Tuning for performance often involves using asynchronous calls, running the risk of introducing bugs and making the program harder to debug. Techniques to prove desirable program properties, such as deadlock freedom, invariably incur significant performance overheads. We have developed a domain-specific language, embedded in C++, called Kanor that enables programmers to specify the communication declaratively in the Bulk Synchronous Parallel BSP style. Deadlock freedom is guaranteed for well-formed Kanor programs. We start with operational semantics for a subset of Kanor and prove deadlock freedom and determinism properties based on those semantics. We then show how the declarative nature of Kanor allows us to detect and optimize communication patterns.

[1]  Frédéric Gava,et al.  Formal Semantics of a Subset of the Paderborn's BSPlib , 2008, 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies.

[2]  D. Callahan,et al.  Analysis of interprocedural side effects in a parallel programming environment , 1988, J. Parallel Distributed Comput..

[3]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[4]  Krzysztof Czarnecki,et al.  DSL Implementation in MetaOCaml, Template Haskell, and C++ , 2003, Domain-Specific Program Generation.

[5]  Andrew Lumsdaine,et al.  Partial globalization of partitioned address spaces for zero-copy communication with shared memory , 2011, 2011 18th International Conference on High Performance Computing.

[6]  Torsten Hoefler,et al.  Runtime detection and optimization of collective communication patterns , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Frédo Durand,et al.  Decoupling algorithms from schedules for easy optimization of image processing pipelines , 2012, ACM Trans. Graph..

[8]  D. Martin Swany,et al.  MPI-aware compiler optimizations for improving communication-computation overlap , 2009, ICS.

[9]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..