A Survey: Runtime Software Systems for High Performance Computing

HPC system design and operation are challenged by the critical requirements for signicant advances in eciency, scalability, user productivity, and performance portability, even at the end of Moore's Law with approaching nano-scale semiconductor technology. Conventional practices employ distributed memory message passing programming interfaces, sometimes combining second level thread-based intra shared memory node interfaces such as OpenMP or with means of controlling heterogeneous components such as OpenCL for GPUs. While these methods include some modest runtime control, they are principally course grained and statically scheduled. Yet, performance for many real-world applications yield eciencies of less than 10% although some benchmarks may achieve 80% eciency or better (e.g., HPL). To address these challenges, strategies employing runtime software systems are being pursued to exploit information about the status of the application and the system hardware operation throughout the execution for purposes of introspection to guide the task scheduling and resource management in support of dynamic adaptive control. Runtime systems provide adaptive means to reduce the eects of starvation, latency, overhead, and contention. While each is unique in its details, many share common properties such as multi-tasking either preemptive or non-preemptive, message-driven computation such as active messages, sophisticated ne-grain synchronization such as dataow and futures contructs, global name or address spaces, and control policies for optimizing task scheduling in part to address the uncertainty of asynchrony. This survey will identify key parameters and properties of modern and sometimes experimental runtime systems actively employed today and provide a detailed description, summary, and comparison within a shared space of dimensions. It is not the intent of this paper to determine which is better or worse but rather to provide sucient detail to permit the reader to select among them according to individual need.

[1]  Jeremiah J. Wilke,et al.  DARMA 0.3.0-alpha Specification , 2016 .

[2]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[3]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[4]  Vivek Sarkar,et al.  HabaneroUPC++: a Compiler-free PGAS Library , 2014, PGAS.

[5]  Eduard Ayguadé,et al.  Programmability and portability for exascale: Top down programming methodology and tools with StarSs , 2013, J. Comput. Sci..

[6]  Torsten Hoefler,et al.  Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters , 2013, 2013 42nd International Conference on Parallel Processing.

[7]  David A. Padua,et al.  Optimization techniques for efficient HTA programs , 2012, Parallel Comput..

[8]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[10]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[11]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[12]  Henry G. Baker,et al.  Actors and Continuous Functionals , 1978, Formal Description of Programming Concepts.

[13]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977, Artificial Intelligence and Programming Languages.

[14]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .