Versatile software-defined HPC and cloud clusters on Alps supercomputer for diverse workflows

Supercomputers have been driving innovations for performance and scaling benefiting several scientific applications for the past few decades. Yet their ecosystems remain virtually unchanged when it comes to integrating distributed data-driven workflows, primarily due to rather rigid access methods and restricted configuration management options. X-as-a-Service model of cloud has introduced, among other features, a developer-centric DevOps approach empowering developers of infrastructure, platform to software artefacts, which, unfortunately contemporary supercomputers still lack. We introduce vClusters (versatile software-defined clusters), which is based on Infrastructure-as-code (IaC) technology. vClusters approach is a unique fusion of HPC and cloud technologies resulting in a software-defined, multi-tenant cluster on a supercomputing ecosystem, that, together with software-defined storage, enable DevOps for complex, data-driven workflows like grid middleware, alongside a classic HPC platform. IaC has been a commonplace in cloud computing, however, it lacked adoption within multi-Petascale ecosystems due to concerns related to performance and interoperability with classic HPC data centres’ ecosystems. We present an overview of the Swiss National Supercomputing Centre’s flagship Alps ecosystem as an implementation target for vClusters for HPC and data-driven workflows. Alps is based on the Cray-HPE Shasta EX supercomputing platform that includes an IaC compliant, microservices architecture (MSA) management system, which we leverage for demonstrating vClusters usage for our diverse operational workflows. We provide implementation details of two operational vClusters platforms: a classic HPC platform that is used predominantly by hundreds of users running thousands of large-scale numerical simulations batch jobs; and a widely used, data-intensive, Grid computing middleware platform used for CERN Worldwide LHC Computing Grid (WLCG) operations. The resulting solution showcases reuse and reduction of common configuration recipes across vCluster implementations, minimising operational change management overheads while introducing flexibility for managing artefacts for DevOps required by diverse workflows.

[1]  Kawthar Shafie Khorassani,et al.  High Performance MPI over the Slingshot Interconnect: Early Experiences , 2022, Practice and Experience in Advanced Research Computing.

[2]  Sufyan bin Uzayr GitHub , 2022, Mastering Git.

[3]  Narges Zarrabi,et al.  Secure Platform for Processing Sensitive Data on Shared HPC Systems , 2021, ArXiv.

[4]  Marcin Pospieszny,et al.  Container orchestration on HPC systems through Kubernetes , 2021, Journal of Cloud Computing.

[5]  J. Wells,et al.  Big PanDa Workflow Management on Titan for High Energy and Nuclear Physics and for Future Extreme Scale Scientific Application , 2021 .

[6]  S. Lammel,et al.  Dynamic Distribution of High-Rate Data Processing from CERN to Remote HPC Data Centers , 2021, Comput. Softw. Big Sci..

[7]  Cory Lueninghoener,et al.  Modernizing the HPC System Software Stack , 2020, ArXiv.

[8]  Samuel Cadellin Skipsey,et al.  Using Continous Deployment techniques to manage software change at a WLCG Tier-2 , 2020, Journal of Physics: Conference Series.

[9]  Laurence Field,et al.  CERN Computing in Commercial Clouds , 2017 .

[10]  Brendan Burns,et al.  Kubernetes: Up and Running: Dive into the Future of Infrastructure , 2017 .

[11]  Michael Heap,et al.  Ansible: From Beginner to Pro , 2016 .

[12]  Kief Morris,et al.  Infrastructure as Code: Managing Servers in the Cloud , 2016 .

[13]  Martin Schulz,et al.  Flux: A Next-Generation Resource Management Framework for Large HPC Centers , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[14]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[15]  A. Yoo,et al.  SLURM: Simple Linux Utility for Resource Management , 2003, JSSPP.

[16]  Daniel J. Milroy,et al.  Towards Standard Kubernetes Scheduling Interfaces for Converged Computing , 2021, IEEE International Conference on Systems, Man and Cybernetics.

[17]  S. Alam,et al.  Software Defined Infrastructure for Operational Numerical Weather Prediction , 2020, IEEE International Conference on Systems, Man and Cybernetics.

[18]  M. Zadka Ansible , 2019, DevOps in Python.

[19]  Christer Lundin Significant Advances in Cray System Architecture for Diagnostics, Availability, Resiliency and Health , 2019 .

[20]  Niall Murphy,et al.  How SRE relates to DevOps , 2018 .