Operating systems and runtime environments on supercomputers

The evolving architecture of high-performance computing nodes poses new challenges to low-level system software components running on the nodes. New multicore and many-core designs put strain on system scalability; the introduction of nonvolatile memory forces a rethink of the memory and storage hierarchy in extreme-scale systems; the overall increase in the component count requires careful consideration of power management and fault tolerance approaches. The International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), held in cooperation with the ACM Special Interest Group on High Performance Computing (SIGHPC), provides a forum for researchers to exchange ideas and discuss research questions that are relevant to low-level system software on current and future-generation supercomputers. This special issue presents significantly extended versions of the best papers from the 2013 edition of the workshop, which was held in conjunction with the ACM International Conference on Supercomputing (ICS) in Eugene, Oregon, USA, on June 10, 2013. In his keynote address Characteristics of Adaptive Runtime Systems in HPC, Sanjay Kale examined the features that are necessary or desirable to empower adaptive runtime systems, using Charm++ and the Adaptive Message Passing interface (MPI) as examples. Nine regular papers and one invited talk were also presented at the workshop in three sessions on exascale and energy, memory and tasking, and big data. Levy et al., in their paper A Study of the Viability of Exploiting Memory Content Similarity to Improve Resilience to Memory Errors, propose a novel runtime for transparently exploiting memory content similarity to reduce the rate of fatal memory errors by examining memory snapshots from HPC operating systems and applications. In Collective I/O under Memory Constraints, Lu et al. introduce a new collective I/O strategy to combat the problem of the decreasing amount of memory available per CPU core, by considering the capacity and bandwidth constraints, restricting aggregation data traffic, coordinating I/O accesses, and choosing aggregators at run time. These research efforts identify major challenges facing OS and runtime system developers and propose promising solutions to address them. We hope that the principles and techniques presented in this special issue will play an important role in the design and development of specialized operating systems for the exascale era.