Modernizing the HPC System Software Stack

Through the 1990s, HPC centers at national laboratories, universities, and other large sites designed distributed system architectures and software stacks that enabled extreme-scale computing. By the 2010s, these centers were eclipsed by the scale of web-scale and cloud computing architectures, and today even upcoming exascale HPC systems are magnitudes of scale smaller than those of datacenters employed by large web companies. Meanwhile, the HPC community has allowed system software designs to stagnate, relying on incremental changes to tried-and-true designs to move between generations of systems. We contend that a modern system software stack that focuses on manageability, scalability, security, and modern methods will benefit the entire HPC community. In this paper, we break down the logical parts of a typical HPC system software stack, look at more modern ways to meet their needs, and make recommendations of future work that would help the community move in that direction.

[1]  Kief Morris,et al.  Infrastructure as Code: Managing Servers in the Cloud , 2016 .

[2]  Jonathan K. Millen,et al.  Principles of remote attestation , 2011, International Journal of Information Security.

[3]  William Gropp,et al.  Implementing MPI on the BlueGene/L Supercomputer , 2004, Euro-Par.

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[6]  Ronald Minnich,et al.  U-root: A Go-based, Firmware Embeddable Root File System with On-demand Compilation , 2015, USENIX Annual Technical Conference.

[7]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[8]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .

[9]  Murat Demirbas,et al.  Consensus in the Cloud: Paxos Systems Demystified , 2016, 2016 25th International Conference on Computer Communication and Networks (ICCCN).

[10]  Thomas Narten,et al.  IPv6 Stateless Address Autoconfiguration , 1996, RFC.

[11]  Cory Lueninghoener Getting Started with Configuration Management , 2011, login Usenix Mag..

[12]  Philip M. Papadopoulos,et al.  Leveraging standard core technologies to programmatically build Linux cluster appliances , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[13]  Barton P. Miller,et al.  Tree-based overlay networks for scalable applications , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[14]  Vanessa Sochat,et al.  Singularity: Scientific containers for mobility of compute , 2017, PloS one.

[15]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[16]  D. Jacobsen,et al.  Contain This, Unleashing Docker for HPC , 2015 .

[17]  Bronis R. de Supinski,et al.  The Spack package manager: bringing order to HPC software chaos , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.