CONCURRENCY AND COMPUTATION : PRACTICE AND EXPERIENCE Concurrency Computat

In the early 1990s, researchers at Sandia National Laboratories and the University of New Mexico began development of customized system software for massively parallel ‘capability’ computing platforms. These lightweight kernels have proven to be essential for delivering the full power of the underlying hardware to applications. This claim is underscored by the success of several supercomputers, including the Intel Paragon, Intel Accelerated Strategic Computing Initiative Red, and the Cray XT series of systems, each having established a new standard for high‐performance computing upon introduction. In this paper, we describe our approach to lightweight compute node kernel design and discuss the design principles that have guided several generations of implementation and deployment. A broad strategy of operating system specialization has led to a focus on user‐level resource management, deterministic behavior, and scalable system services. The relative importance of each of these areas has changed over the years in response to changes in applications and hardware and system architecture. We detail our approach and the associated principles, describe how our application of these principles has changed over time, and provide design and performance comparisons to contemporaneous supercomputing operating systems. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Hubertus Franke,et al.  Customization Lite , 1997 .

[2]  Dilma Da Silva,et al.  K42: lessons for the OS community , 2008, OPSR.

[3]  Julia L. Lawall,et al.  Proceedings of the 2002 Usenix Annual Technical Conference Think: a Software Framework for Component-based Operating System Kernels , 2022 .

[4]  Rolf Riesen,et al.  PUMA: an operating system for massively parallel systems , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[5]  Rolf Riesen,et al.  Lightweight I/O for Scientific Applications , 2006, 2006 IEEE International Conference on Cluster Computing.

[6]  T. Mack Stallcup,et al.  Parallel Real-Time Operating System for Secure Environments , 1996 .

[7]  J. M. McGlaun,et al.  CTH: A software family for multi-dimensional shock physics analysis , 1995 .

[8]  Fabrizio Petrini,et al.  Predictive Performance and Scalability Modeling of a Large-Scale Application , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[9]  Keith D. Underwood,et al.  SeaStar Interconnect: Balanced Bandwidth for Scalable Performance , 2006, IEEE Micro.

[10]  Rolf Riesen,et al.  Design and Implementation of MPI on Portals 3.0 , 2002, PVM/MPI.

[11]  Dong Chen,et al.  QCDSP: A Teraflop Scale Massively Parallel Supercomputer , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[12]  Thierry Coupaye,et al.  An Open Component Model and Its Support in Java , 2004, CBSE.

[13]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[14]  Ronald B. Brightwell,et al.  The portals 3.3 message passing interface document revision 2.1. , 2006 .

[15]  Subhash Saini,et al.  Applications performance under OSF/1 AD and SUNMOS on Intel Paragon XP/S-15 , 1994, Proceedings of Supercomputing '94.

[16]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.

[17]  TournierJean-Charles,et al.  Towards a framework for dedicated operating systems development in high-end computing systems , 2006 .

[18]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[19]  R. Brightwell,et al.  A performance comparison of myrinet protocol stacks , 2002 .

[20]  Darren J. Kerbyson,et al.  A Performance Model of the Parallel Ocean Program , 2005, Int. J. High Perform. Comput. Appl..

[21]  T. Inglett,et al.  Designing a Highly-Scalable Operating System: The Blue Gene/L Story , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[22]  Keith D. Underwood,et al.  A performance comparison of Linux and a lightweight kernel , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[23]  Rolf Riesen,et al.  Portals 3.0: protocol building blocks for low overhead communication , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[24]  Ron Brightwell,et al.  The Portals 3.0 Message Passing Interface Revision 1.0 , 1999 .

[25]  Butler W. Lampson,et al.  Hints for Computer System Design , 1983, IEEE Software.

[26]  Subhash Saini,et al.  Applications Performance on Intel Paragon XP/S-15 , 1994, HPCN.

[27]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[28]  David S. Greenberg,et al.  Communication on the Paragon , 1993 .