Argo NodeOS: Toward Unified Resource Management for Exascale

Exascale systems are expected to feature hundreds of thousands of compute nodes with hundreds of hardware threads and complex memory hierarchies with a mix of on-package and persistent memory modules. In this context, the Argo project is developing a new operating system for exascale machines. Targeting production workloads using workflows or coupled codes, we improve the Linux kernel on several fronts. We extendthe memory management of Linux to be able to subdivide NUMA memory nodes, allowing better resource partitioning among processes running on the same node. We also add support for memory-mapped access tonode-local, PCIe-attached NVRAM devices and introduce a new scheduling class targeted at parallel runtimes supporting user-level load balancing. These features are unified into compute containers, a containerization approach focused on providing modern HPC applications with dynamic control over a wide range of kernel interfaces. To keep our approach compatible with industrial containerization products, we also identifycontentions points for the adoption of containers in HPC settings. Each NodeOS feature is evaluated by using a set of parallel benchmarks, miniapps, and coupled applications consisting of simulation and data analysis components, running on a modern NUMA platform. We observe out-of-the-box performance improvements easily matching, and often exceeding, those observed with expert-optimized configurations on standard OS kernels. Our lightweight approach to resource management retains the many benefits of a full OS kernel that application programmers have learned to depend on, at the same time providing a set of extensions that can be freely mixed and matched to best benefit particular application components.

[1]  Karsten Schwan,et al.  FlexIO: I/O Middleware for Location-Flexible Scientific Data Analytics , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  J. Kubiatowicz,et al.  Resource Management in the Tessellation Manycore OS ∗ , 2010 .

[3]  Edward David Moreno,et al.  Performance Analysis of LXC for HPC Environments , 2015, 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems.

[4]  Mateo Valero,et al.  Evaluating the Impact of TLB Misses on Future HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[5]  Bruno Raffin,et al.  A Flexible Framework for Asynchronous in Situ and in Transit Analytics for Scientific Simulations , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[6]  Maya Gokhale,et al.  Multi-threaded streamline tracing for data-intensive architectures , 2014, 2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV).

[7]  Maya Gokhale,et al.  DI-MMAP—a scalable memory-map runtime for out-of-core data-intensive applications , 2015, Cluster Computing.

[8]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[9]  Kamil Iskra,et al.  Performance and Scalability Evaluation of ‘Big Memory’ on Blue Gene Linux , 2011, Int. J. High Perform. Comput. Appl..

[10]  Susan Coghlan,et al.  Operating system issues for petascale systems , 2006, OPSR.

[11]  Susan Coghlan,et al.  Benchmarking the effects of operating system interference on extreme-scale parallel machines , 2008, Cluster Computing.

[12]  Rolf Riesen,et al.  SUNMOS for the Intel Paragon - a brief user`s guide , 1994 .

[13]  Peter A. Dinda,et al.  Palacios and Kitten: New high performance operating systems for scalable virtualized and native supercomputing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[14]  Franck Cappello,et al.  Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O , 2012, 2012 IEEE International Conference on Cluster Computing.

[15]  Rolf Riesen,et al.  mOS: an architecture for extreme-scale operating systems , 2014, ROSS@ICS.

[16]  R. Gioiosa,et al.  Analysis of system overhead on parallel computers , 2004, Proceedings of the Fourth IEEE International Symposium on Signal Processing and Information Technology, 2004..

[17]  Yutaka Ishikawa,et al.  Partially Separated Page Tables for Efficient Operating System Assisted Hierarchical Memory Management on Heterogeneous Architectures , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[18]  D. Jacobsen,et al.  Contain This, Unleashing Docker for HPC , 2015 .

[19]  Jeremy S. Meredith,et al.  Parallel in situ coupling of simulation with a fully featured visualization system , 2011, EGPGV '11.

[20]  Mark Giampapa,et al.  Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Rolf Riesen,et al.  Designing and implementing lightweight kernels for capability computing , 2009 .

[22]  Francisco J. Cazorla,et al.  A Quantitative Analysis of OS Noise , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[23]  Klaus Schulten,et al.  Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System Trajectories , 2012, EuroVis.

[24]  Kevin Klues,et al.  Improving per-node efficiency in the datacenter with new OS abstractions , 2011, SoCC.

[25]  César A. F. De Rose,et al.  Performance Evaluation of Container-Based Virtualization for High Performance Computing Environments , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[26]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, HiPC 2008.

[27]  Allen D. Malony,et al.  The ghost in the machine: observing the effects of kernel operation on parallel application performance , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[28]  Peter M. Kasson,et al.  GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit , 2013, Bioinform..

[29]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[30]  LarssonPer,et al.  GROMACS 4.5 , 2013 .

[31]  Karsten Schwan,et al.  Just in time: adding value to the IO pipelines of high performance applications with JITStaging , 2011, HPDC '11.

[32]  Yoonho Park,et al.  FusedOS: Fusing LWK Performance with FWK Functionality in a Heterogeneous Environment , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[33]  Alex Brooks,et al.  Argobots: A Lightweight Low-Level Threading and Tasking Framework , 2018, IEEE Transactions on Parallel and Distributed Systems.

[34]  Brian Kocoloski,et al.  HPMMAP: Lightweight Memory Management for Commodity Operating Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[35]  Sameer Kumar,et al.  Evaluating the effect of replacing CNK with linux on the compute-nodes of blue gene/l , 2008, ICS '08.

[36]  Ray W. Grout,et al.  Ultrascale Visualization In Situ Visualization for Large-Scale Combustion Simulations , 2010 .

[37]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[38]  Mateo Valero,et al.  Designing OS for HPC Applications: Scheduling , 2010, 2010 IEEE International Conference on Cluster Computing.