ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

This report provides in-depth information and analysis to help create a technical road map for developing nextgeneration programming models and runtime systems that support Advanced Simulation and Computing (ASC) workload requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of “exascale” computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AMT runtime systems—Charm++, Legion, and Uintah, all of which are in use as part of the ASC Predictive Science Academic Alliance Program II (PSAAP-II) Centers. The studies focus on each of the runtimes’ programmability, performance, and mutability. Through the experiments and analysis presented, several overarching findings emerge. From a performance perspective, AMT runtimes show tremendous potential for addressing extremescale challenges. Empirical studies show an AMT runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MPI) and AMT runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a codesign path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the high-performance computing (HPC) community as a whole, with widespread community engagement mitigating risk for both application developers and runtime system developers.

[1]  Carl Hewitt,et al.  A Universal Modular ACTOR Formalism for Artificial Intelligence , 1973, IJCAI.

[2]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[3]  Laxmikant V. Kalé,et al.  The Chare Kernel Parallel Programming Language and System , 1990, ICPP.

[4]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[5]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[6]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[7]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[8]  Laxmikant V. Kalé,et al.  Multiparadigm, Multilingual Interoperability: Experience with Converse , 1998, IPPS/SPDP Workshops.

[9]  Edward A. Luke,et al.  Loci: A Deductive Framework for Graph-Based Algorithms , 1999, ISCOPE.

[10]  William Gropp,et al.  Toward Scalable Performance Visualization with Jumpshot , 1999, Int. J. High Perform. Comput. Appl..

[11]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[12]  Steven G. Parker,et al.  Uintah: a massively parallel problem solving environment , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[13]  Nancy M. Amato,et al.  STAPL: An Adaptive, Generic Parallel C++ Library , 2001, LCPC.

[14]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[15]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[16]  Peter Van Roy,et al.  Concepts, Techniques, and Models of Computer Programming , 2004 .

[17]  Sameer Kumar,et al.  Scalable fine‐grained parallelization of plane‐wave–based ab initio molecular dynamics for large supercomputers , 2004, J. Comput. Chem..

[18]  Laxmikant V. Kalé,et al.  Debugging support for Charm++ , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[19]  Allen D. Malony,et al.  Performance Analysis Integration in the Uintah Software Development Cycle , 2003, International Journal of Parallel Programming.

[20]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[21]  Laxmikant V. Kalé,et al.  Scaling applications to massively parallel machines using Projections performance analysis tool , 2006, Future Gener. Comput. Syst..

[22]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[23]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[24]  Amitabh Sinha,et al.  Projections : A Preliminary Performance Tool for Charm , 2007 .

[25]  L. Kalé,et al.  Towards Petascale Cosmological Simulations with ChaNGa , 2007 .

[26]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[27]  Laxmikant V. Kalé,et al.  Massively parallel cosmological simulations with ChaNGa , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[28]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[29]  Laxmikant V. Kalé,et al.  Continuous performance monitoring for large-scale parallel applications , 2009, 2009 International Conference on High Performance Computing (HiPC).

[30]  Laxmikant V. Kalé,et al.  Integrated Performance Views in Charm++: Projections Meets TAU , 2009, 2009 International Conference on Parallel Processing.

[31]  William Gropp MPI at Exascale: Challenges for Data Structures and Algorithms , 2009, PVM/MPI.

[32]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[33]  Laxmikant V. Kalé,et al.  Debugging Large Scale Applications in a Virtualized Environment , 2010, LCPC.

[34]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[35]  Justin Luitjens,et al.  Improving the performance of Uintah: A large-scale adaptive meshing computational framework , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[36]  Kesheng Wu,et al.  Scientific Discovery at the Exascale , 2011 .

[37]  S. Dosanjh,et al.  Architectures and Technology for Extreme Scale Computing Report from the Workshop Node Architecture and Power Reduction Strategies , 2011 .

[38]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  Qingyu Meng,et al.  Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system , 2012, XSEDE '12.

[40]  Qingyu Meng,et al.  The uintah framework: a unified heterogeneous task scheduling and runtime system , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[41]  Thomas Heller,et al.  Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.

[42]  Mark Anders,et al.  Near-threshold voltage (NTV) design — Opportunities and challenges , 2012, DAC Design Automation Conference 2012.

[43]  Laxmikant V. Kalé,et al.  A distributed dynamic load balancer for iterative applications , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[44]  Alexander Aiken,et al.  Language support for dynamic, hierarchical data partitioning , 2013, OOPSLA.

[45]  Robert Dietrich,et al.  OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis , 2013, IWOMP.

[46]  James H. Laros,et al.  PowerInsight - A commodity power measurement capability , 2013, 2013 International Green Computing Conference Proceedings.

[47]  Qingyu Meng,et al.  Preliminary experiences with the uintah framework on Intel Xeon Phi and stampede , 2013, XSEDE.

[48]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[49]  Laxmikant V. Kalé,et al.  Mapping to Irregular Torus Topologies and Other Techniques for Petascale Biomolecular Simulation , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[50]  Alexander Aiken,et al.  Structure Slicing: Extending Logical Regions with Fields , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[52]  Lukasz Wesolowski,et al.  Adaptive techniques for clustered N-body cosmological simulations , 2014, 1409.1929.

[53]  L. Kalé,et al.  Charm + + & MPI : Combining the Best of Both Worlds , 2014 .

[54]  Laxmikant V. Kalé,et al.  Overcoming the Scalability Challenges of Epidemic Simulations on Blue Waters , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[55]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[56]  Laxmikant V. Kalé,et al.  PICS: a performance-analysis-based introspective control system to steer parallel applications , 2014, ROSS@ICS.

[57]  Anthony Skjellum,et al.  Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[58]  Alexander Aiken,et al.  Realm: An event-based low-level runtime for distributed memory architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[59]  Scott Klasky,et al.  Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[60]  Abhishek Gupta,et al.  Parallel Programming with Migratable Objects: Charm++ in Practice , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[61]  Bernd Hamann,et al.  Dissecting On-Node Memory Access Performance: A Semantic Approach , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[62]  Michael Bauer Legion: Programming Distributed Heterogeneous Architectures with Logical Regions , 2014 .

[63]  Laxmikant V. Kalé,et al.  Scalable replay with partial-order dependencies for message-logging fault tolerance , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[64]  John Shalf,et al.  Abstract Machine Models and Proxy Architectures for Exascale Computing , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.

[65]  Richard D. Hornung,et al.  The RAJA Portability Layer: Overview and Status , 2014 .

[66]  Bernd Hamann,et al.  Combing the Communication Hairball: Visualizing Parallel Execution Traces using Logical Time , 2014, IEEE Transactions on Visualization and Computer Graphics.

[67]  Michael A. Heroux,et al.  Toward Local Failure Local Recovery Resilience Model using MPI-ULFM , 2014, EuroMPI/ASIA.

[68]  Martin Berzins,et al.  A Scalable Algorithm for Radiative Heat Transfer Using Reverse Monte Carlo Ray Tracing , 2015, ISC.

[69]  Martin Schulz,et al.  A Flexible Data Model to Support Multi-domain Performance Analysis , 2015 .

[70]  Paul Lin,et al.  CFD for Next Generation Hardware: Experiences with Proxy Applications. , 2015 .

[71]  Bernd Hamann,et al.  Recovering logical structure from Charm++ event traces , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[72]  Alexander Aiken,et al.  Regent: a high-productivity programming language for HPC with logical regions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[73]  Bronis R. de Supinski,et al.  The Spack package manager: bringing order to HPC software chaos , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[74]  Charles R. Ferenbaugh,et al.  PENNANT: an unstructured mesh mini‐app for advanced architecture research , 2015, Concurr. Comput. Pract. Exp..

[75]  Laxmikant V. Kalé,et al.  A Fault-Tolerance Protocol for Parallel Applications with Communication Imbalance , 2015, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[76]  Laxmikant V. Kalé,et al.  Using Migratable Objects to Enhance Fault Tolerance Schemes in Supercomputers , 2015, IEEE Transactions on Parallel and Distributed Systems.