ECP Software Technology Capability Assessment Report

Author(s): Heroux, Michael A; Carter, Jonathan; Thakur, Rajeev; Vetter, Jeffrey; McInnes, Lois Curfman; Ahrens, James; Neely, J Robert | Abstract: The Exascale Computing Project (ECP) Software Technology (ST) Focus Area is responsible for developing critical software capabilities that will enable successful execution of ECP applications, and for providing key components of a productive and sustainable Exascale computing ecosystem that will position the US Department of Energy (DOE) and the broader high performance (HPC) community with a rm foundation for future extreme-scale computing capabilities. This ECP ST Capability Assessment Report (CAR) provides an overview and assessment of current ECP ST capabilities and activities, giving stakeholders and the broader HPC community information that can be used to assess ECP ST progress and plan their own efforts accordingly. ECP ST leaders commit to updating this document on regular basis (targeting approximately every six months). Highlights from the report are presented here.

[1]  Dmitri Kuzmin,et al.  Sequential limiting in continuous and discontinuous Galerkin methods for the Euler equations , 2018, J. Comput. Phys..

[2]  V. E. Henson,et al.  BoomerAMG: a parallel algebraic multigrid solver and preconditioner , 2002 .

[3]  Kenneth Moreland,et al.  Visualization for Exascale: Portable Performance is Critical , 2015, Supercomput. Front. Innov..

[4]  Kwan-Liu Ma,et al.  Flexible Analysis Software for Emerging Architectures , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[5]  Martin Schulz,et al.  Thread-local concurrency: a technique to handle data race detection at programming model abstraction , 2018, HPDC.

[6]  Rajeev Thakur,et al.  Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming , 2010, Int. J. High Perform. Comput. Appl..

[7]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[8]  Stephen L. Olivier,et al.  OpenMPIR: Implementing OpenMP Tasks with Tapir , 2017, LLVM-HPC@SC.

[9]  Corporate The MPI Forum MPI: a message passing interface , 1993, Supercomputing '93.

[10]  Reid Priedhorsky,et al.  Charliecloud: Unprivileged Containers for User-Defined Software Stacks in HPC , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Hubert Ritzdorf,et al.  The scalable process topology interface of MPI 2.2 , 2011, Concurr. Comput. Pract. Exp..

[12]  Peng Li,et al.  Combining events and threads for scalable network services implementation and evaluation of monadic, application-level concurrency primitives , 2007, PLDI '07.

[13]  Patrick S. McCormick,et al.  Accommodating Thread-Level Heterogeneity in Coupled Parallel Applications , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[14]  Jonathan Green,et al.  Multi-core and Network Aware MPI Topology Functions , 2011, EuroMPI.

[15]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[16]  Jeremy S. Meredith,et al.  Parallel in situ coupling of simulation with a fully featured visualization system , 2011, EGPGV '11.

[17]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[18]  Torsten Hoefler,et al.  Mpi on Millions of Cores * , 2022 .

[19]  Martin Schulz,et al.  Production Hardware Overprovisioning: Real-World Performance Optimization Using an Extensible Power-Aware Resource Management Framework , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  F. Cappello,et al.  Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[21]  Jeffrey Cornelis,et al.  The Communication-Hiding Conjugate Gradient Method with Deep Pipelines , 2018, ArXiv.

[22]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[23]  Akinori Yonezawa,et al.  StackThreads/MP: integrating futures into calling standards , 1999, PPoPP '99.

[24]  Kevin T. Pedretti,et al.  A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds , 2017, 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[25]  Michael Bauer,et al.  S3D-Legion : An Exascale Software for Direct Numerical Simulation of Turbulent Combustion with Complex Multicomponent Chemistry , 2017 .

[26]  Gregory Becker,et al.  Managing Combinatorial Software Installations with Spack , 2016, 2016 Third International Workshop on HPC User Support Tools (HUST).

[27]  Vivek Sarkar,et al.  Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling , 2018, CC.

[28]  Tamara G. Kolda,et al.  Parallel Tensor Compression for Large-Scale Scientific Data , 2015, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[29]  Abhinav Vishnu,et al.  On the suitability of MPI as a PGAS runtime , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[30]  Ada Gavrilovska,et al.  CoMerge: toward efficient data placement in shared heterogeneous memory systems , 2017, MEMSYS.

[31]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[32]  Tzanio V. Kolev,et al.  Multi‐material closure model for high‐order finite element Lagrangian hydrodynamics , 2016 .

[33]  Bronis R. de Supinski,et al.  The Spack package manager: bringing order to HPC software chaos , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Martin Schulz,et al.  A Unified Platform for Exploring Power Management Strategies , 2016, 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC).

[35]  William Gropp,et al.  PETSc Users Manual Revision 3.4 , 2016 .

[36]  Pavan Balaji,et al.  Process-Based Asynchronous Progress Model for MPI Point-to-Point Communication , 2017, 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[37]  Veselin Dobrev,et al.  Curvilinear finite elements for Lagrangian hydrodynamics , 2011 .

[38]  Hank Childs,et al.  Ray tracing within a data parallel framework , 2015, 2015 IEEE Pacific Visualization Symposium (PacificVis).

[39]  Anders Clausen,et al.  Supercomputing Centers and Electricity Service Providers: A Geographically Distributed Perspective on Demand Management in Europe and the United States , 2016, ISC.

[40]  Guang R. Gao,et al.  TiNy threads: a thread virtual machine for the Cyclops64 cellular architecture , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[41]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[42]  Lois C. McInnes,et al.  xSDK Foundations: Toward an Extreme-scale Scientific Software Development Kit , 2017, Supercomput. Front. Innov..

[43]  Michael W. Mahoney Boyd,et al.  Randomized Algorithms for Matrices and Data , 2010 .

[44]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[45]  Tzanio V. Kolev,et al.  High order curvilinear finite elements for elastic-plastic Lagrangian dynamics , 2014, J. Comput. Phys..

[46]  Anders Logg,et al.  DOLFIN: Automated finite element computing , 2010, TOMS.

[47]  Leonid Oliker,et al.  Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[49]  Kwan-Liu Ma,et al.  VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures , 2016, IEEE Computer Graphics and Applications.

[50]  Alexander Aiken,et al.  Realm: An event-based low-level runtime for distributed memory architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[51]  S. Ashby,et al.  A parallel multigrid preconditioned conjugate gradient algorithm for groundwater flow simulations , 1996 .

[52]  Smith Barry,et al.  xSDK Community Installation Policies: GNU Autoconf and CMake Options , 2016 .

[53]  Jack J. Dongarra,et al.  Improving the Performance of CA-GMRES on Multicores with Multiple GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[54]  Prabhat,et al.  Extreme Scaling of Production Visualization Software on Diverse Architectures , 2010, IEEE Computer Graphics and Applications.

[55]  Jungwon Kim,et al.  PapyrusKV: A High-Performance Parallel Key-Value Store for Distributed NVM Architectures , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[56]  Manuel Quezada de Luna,et al.  High-order local maximum principle preserving (MPP) discontinuous Galerkin finite element method for the transport equation , 2017, J. Comput. Phys..

[57]  Utkarsh Ayachit,et al.  The ParaView Guide: A Parallel Visualization Application , 2015 .

[58]  Kwan-Liu Ma,et al.  Finely-Threaded History-Based Topology Computation , 2014, EGPGV@EuroVis.

[59]  James P. Ahrens,et al.  PISTON: A Portable Cross-Platform Framework for Data-Parallel Visualization Operators , 2012, EGPGV@Eurographics.

[60]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[61]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[62]  Pavan Balaji,et al.  Memory Compression Techniques for Network Address Management in MPI , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[63]  Robert J. Fowler,et al.  Multi-threaded library for many-core systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[64]  Kamil Iskra,et al.  Exploring Data Migration for Future Deep-Memory Many-Core Systems , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[65]  Mathias Jacquelin,et al.  Highly scalable distributed-memory sparse triangular solution algorithms , 2018, CSC.

[66]  Sayantan Sur,et al.  Why Is MPI So Slow? Analyzing the Fundamental Limits in Implementing MPI-3.1 , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[67]  Simone Atzeni,et al.  SWORD: A Bounded Memory-Overhead Detector of OpenMP Data Races in Production Runs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[68]  Martin Schulz,et al.  Systemwide Power Management with Argo , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[69]  Alex Brooks,et al.  Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[70]  Laxmikant V. Kalé,et al.  Threads for Interoperable Parallel Programming , 1996, LCPC.

[71]  Jack J. Dongarra,et al.  Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[72]  Kamil Iskra,et al.  In Situ Workflows at Exascale: System Software to the Rescue , 2017, ISAV@SC.

[73]  Seda Ogrenci Memik,et al.  Minimizing Thermal Variation Across System Components , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[74]  Katherine A. Yelick,et al.  UPC++: A PGAS Extension for C++ , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[75]  Edmond Chow,et al.  Iterative Sparse Triangular Solves for Preconditioning , 2015, Euro-Par.

[76]  Kwan-Liu Ma,et al.  A classification of scientific visualization algorithms for massive threading , 2013, UltraVis@SC.

[77]  Robert Latham,et al.  Portable Topology-Aware MPI-I/O , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[78]  Alexander S. Szalay,et al.  Extreme Event Analysis in Next Generation Simulation Architectures , 2017, ISC.

[79]  Gregory Becker,et al.  Using Spack to Manage Software on Cray Supercomputers , 2017 .

[80]  Scott B. Baden,et al.  The UPC++ PGAS library for Exascale Computing , 2017, PAW@SC.

[81]  C. C. Law,et al.  ParaView: An End-User Tool for Large-Data Visualization , 2005, The Visualization Handbook.

[82]  Guillaume Mercier,et al.  Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.

[83]  Michael E. Papka,et al.  Large-Scale Data Visualization Using Parallel Data Streaming , 2001, IEEE Computer Graphics and Applications.

[84]  Akinori Yonezawa,et al.  Fine-grain multithreading with minimal compiler support—a cost effective approach to implementing efficient multithreading languages , 1997, PLDI '97.

[85]  Edmond Chow,et al.  Asynchronous Iterative Algorithm for Computing Incomplete Factorizations on GPUs , 2015, ISC.

[86]  Marvin Theimer,et al.  Cooperative Task Management Without Manual Stack Management , 2002, USENIX Annual Technical Conference, General Track.

[87]  Jack Dongarra,et al.  Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale , 2017 .

[88]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[89]  Samuel Thibault,et al.  A Flexible Thread Scheduler for Hierarchical Multiprocessor Machines , 2005, ArXiv.

[90]  Maya Gokhale,et al.  Argo NodeOS: Toward Unified Resource Management for Exascale , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[91]  Pavan Balaji,et al.  A Performance Study of UCX over InfiniBand , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[92]  Lisa Gerhardt,et al.  Shifter: Containers for HPC , 2017 .

[93]  Mark S. Gordon,et al.  Chapter 41 – Advances in electronic structure theory: GAMESS a decade later , 2005 .

[94]  Muneeb Ali,et al.  Protothreads: simplifying event-driven programming of memory-constrained embedded systems , 2006, SenSys '06.

[95]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[96]  Victor Alessandrini Intel Threading Building Blocks , 2016 .

[97]  David E. Bernholdt,et al.  OpenMP 4.5 Validation and Verification Suite for Device Offload , 2018, IWOMP.

[98]  Pavan Balaji,et al.  Advanced Thread Synchronization for Multithreaded MPI Implementations , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[99]  Pavan Balaji,et al.  Hexe: A Toolkit for Heterogeneous Memory Management , 2017, 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS).

[100]  Mark Rice,et al.  GridPACK: A Framework for Developing Power Grid Simulations on High Performance Computing Platforms , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[101]  Jungwon Kim,et al.  Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[102]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[103]  Satoshi Matsuoka,et al.  MPI+Threads: runtime contention and remedies , 2015, PPOPP.

[104]  Xiaoye S. Li,et al.  A distributed-memory approximation algorithm for maximum weight perfect bipartite matching , 2018, ArXiv.

[105]  George Bosilca,et al.  Using software-based performance counters to expose low-level open MPI performance information , 2017, EuroMPI/USA.

[106]  George C. Necula,et al.  Capriccio: scalable threads for internet services , 2003, SOSP '03.

[107]  Andrew M. Bradley,et al.  Towards Performance Portability in a Compressible CFD Code , 2017 .

[108]  Sunita Chandrasekaran,et al.  OpenACC 2.5 Validation Testsuite Targeting Multiple Architectures , 2017, ISC Workshops.

[109]  Tzanio V. Kolev,et al.  High-Order Multi-Material ALE Hydrodynamics , 2018, SIAM J. Sci. Comput..

[110]  Kevin T. Pedretti,et al.  Characterizing MPI matching via trace-based simulation , 2017, EuroMPI/USA.

[111]  Raymond Namyst,et al.  MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[112]  Corporate SunSoft Solaris multithreaded programming guide , 1995 .

[113]  Franck Cappello,et al.  Distributed Monitoring and Management of Exascale Systems in the Argo Project , 2015, DAIS.

[114]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[115]  Maya Gokhale,et al.  A Container-Based Approach to OS Specialization for Exascale Computing , 2015, 2015 IEEE International Conference on Cloud Engineering.

[116]  James P. Ahrens,et al.  The ALPINE In Situ Infrastructure: Ascending from the Ashes of Strawman , 2017, ISAV@SC.

[117]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[118]  Jack Dongarra,et al.  Designing SLATE: Software for Linear Algebra Targeting Exascale , 2017 .

[119]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[120]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[121]  Utkarsh Ayachit,et al.  ParaView Catalyst: Enabling In Situ Data Analysis and Visualization , 2015, ISAV@SC.

[122]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[123]  Dan Bonachea,et al.  GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network , 2018, 2018 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM).

[124]  Wu-chun Feng,et al.  MPI-ACC: Accelerator-Aware MPI for Scientific Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[125]  Tzanio V. Kolev,et al.  High-Order Curvilinear Finite Element Methods for Lagrangian Hydrodynamics , 2012, SIAM J. Sci. Comput..

[126]  William Schroeder,et al.  The Visualization Toolkit: An Object-Oriented Approach to 3-D Graphics , 1997 .

[127]  Robert Sisneros,et al.  EAVL: The Extreme-scale Analysis and Visualization Library , 2012, EGPGV@Eurographics.

[128]  Hal Finkel,et al.  Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading , 2017, LLVM-HPC@SC.

[129]  Robert D. Falgout,et al.  The Design and Implementation of hypre, a Library of Parallel High Performance Preconditioners , 2006 .

[130]  David M. Beazley,et al.  SWIG: An Easy to Use Tool for Integrating Scripting Languages with C and C++ , 1996, Tcl/Tk Workshop.

[131]  Daniel J. Rader,et al.  Direct simulation Monte Carlo: The quest for speed , 2014 .

[132]  Jesper Larsson Träff,et al.  Exploiting Common Neighborhoods to Optimize MPI Neighborhood Collectives , 2017, 2017 IEEE 24th International Conference on High Performance Computing (HiPC).

[133]  Kenneth Moreland Oh, $#*@! Exascale! The Effect of Emerging Architectures on Scientific Discovery , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[134]  Ying Wai Li,et al.  QMCPACK: an open source ab initio quantum Monte Carlo package for the electronic structure of atoms, molecules and solids , 2018, Journal of physics. Condensed matter : an Institute of Physics journal.

[135]  Kenjiro Taura,et al.  MassiveThreads: A Thread Library for High Productivity Languages , 2014, Concurrent Objects and Beyond.

[136]  Sriram Krishnamoorthy,et al.  Work stealing for GPU‐accelerated parallel programs in a global address space framework , 2016, Concurr. Comput. Pract. Exp..

[137]  Jack J. Dongarra,et al.  Investigating power capping toward energy‐efficient scientific applications , 2019, Concurr. Comput. Pract. Exp..

[138]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[139]  Hank Childs,et al.  VisIt: An End-User Tool for Visualizing and Analyzing Very Large Data , 2011 .

[140]  Ian Briggs,et al.  FLiT: Cross-platform floating-point result-consistency tester and workload , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[141]  Gokhan Memik,et al.  Addressing Thermal and Performance Variability Issues in Dynamic Processors , 2017 .

[142]  Ralf S. Engelschall Portable Multithreading-The Signal Stack Trick for User-Space Thread Creation , 2000, USENIX Annual Technical Conference, General Track.

[143]  Michael Lang,et al.  NUMA Distance for Heterogeneous Memory , 2017, MCHPC@SC.

[144]  Franck Cappello,et al.  FTI: High performance Fault Tolerance Interface for hybrid systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[145]  Tzanio V. Kolev,et al.  High-order curvilinear finite elements for axisymmetric Lagrangian hydrodynamics , 2013 .

[146]  Jack J. Dongarra,et al.  Incomplete Sparse Approximate Inverses for Parallel Preconditioning , 2018, Parallel Comput..