HPC-Colony: services and interfaces for very large systems

Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.

[1]  Xiaobo Li,et al.  On the Communication Complexity of Generalized 2-D Convolution on Array Processors , 1989, IEEE Trans. Computers.

[2]  Franco Zambonelli,et al.  Diffusive load-balancing policies for dynamic applications , 1999, IEEE Concurr..

[3]  Laxmikant V. Kalé,et al.  A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[4]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[5]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[6]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[7]  Scott Pakin,et al.  Dynamic Coscheduling on Workstation Clusters , 1998, JSSPP.

[8]  Gengbin Zheng,et al.  Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing , 2005 .

[9]  Laxmikant V. Kalé,et al.  NAMD: Biomolecular Simulation on Thousands of Processors , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  José E. Moreira,et al.  Blue Gene/L programming and operating environment , 2005, IBM J. Res. Dev..

[11]  John Paul Shen,et al.  Interprocessor Traffic Scheduling Algorithm for Multiple-Processor Networks , 1987, IEEE Transactions on Computers.

[12]  Karen D. Devinea,et al.  New Challenges in Dynamic Load Balancing , 2004 .

[13]  J. Ramanujam,et al.  Task allocation onto a hypercube by recursive mincut bipartitioning , 1990, C3P.

[14]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[15]  Terry Jones,et al.  Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[16]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[17]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[18]  Anna Hác,et al.  Dynamic Load Balancing in a Distributed System Using a Decentralized Algorithm , 1987, ICDCS.

[19]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[20]  Paul Terry,et al.  Improving application performance on HPC systems with process synchronization , 2004 .

[21]  Thierry Coupez,et al.  Dynamic load-balancing of finite element applications with the DRAMA library , 2000 .

[22]  Jack J. Dongarra,et al.  Building and Using a Fault-Tolerant MPI Implementation , 2004, Int. J. High Perform. Comput. Appl..

[23]  Anthony Skjellum,et al.  MPI/FT/sup TM/: architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[24]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[25]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[26]  Adrianos Lachanas,et al.  MPI-FT: Portable Fault Tolerance Scheme for MPI , 2000, Parallel Process. Lett..

[27]  B. Bouteiller,et al.  MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[28]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[29]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[30]  Kai Li,et al.  CLIP: A Checkpointing Tool for Message Passing Parallel Programs , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[31]  Laxmikant V. Kalé,et al.  Topology-aware task mapping for reducing communication contention on large parallel machines , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[32]  William H. Cabot,et al.  Large-scale simulations with miranda on bluegene/l , 2003 .

[33]  Chao Huang SYSTEM SUPPORT FOR CHECKPOINT AND RESTART OF CHARM++ AND AMPI APPLICATIONS , 2004 .

[34]  Laxmikant V. Kalé,et al.  A load balancing strategy for prioritized execution of tasks , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[35]  J. D. Teresco,et al.  New challanges in dynamic load balancing , 2005 .

[36]  Wesley W. Chu,et al.  Task Allocation and Precedence Relations for Distributed Real-Time Systems , 1987, IEEE Transactions on Computers.

[37]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[38]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.