Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System

A parallel application benefits from scheduling policies that include a global perspective of the application's process working set. As the interactions among cooperating processes increase, mechanisms to ameliorate waiting within one or more of the processes become more important. In particular, collective operations such as barriers and reductions are extremely sensitive to even usually harmless events such as context switches among members of the process working set. For the last 18 months, we have been researching the impact of random short-lived interruptions such as timer-decrement processing and periodic daemon activity, and developing strategies to minimize their impact on large processor-count SPMD bulk-synchronous programming styles. We present a novel co-scheduling scheme for improving performance of fine-grain collective activities such as barriers and reductions, describe an implementation consisting of operating system kernel modifications and run-time system, and present a set of empirical results comparing the technique with traditional operating system scheduling. Our results indicate a speedup of over 300% on synchronizing collectives.

[1]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[2]  Ronald Mraz,et al.  Reducing the variance of point to point transfers in the IBM 9076 parallel computer , 1994, Proceedings of Supercomputing '94.

[3]  Andrea C. Arpaci-Dusseau,et al.  Effective distributed scheduling of parallel workloads , 1996, SIGMETRICS '96.

[4]  Larry Rudolph,et al.  Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..

[5]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[6]  Liana L. Fong,et al.  A Gang-Scheduling System for ASCI Blue-Pacific , 1999, HPCN Europe.

[7]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[8]  Nigel P. Topham,et al.  Performance of the decoupled ACRI-1 architecture: the perfect club , 1995, HPCN Europe.

[9]  A. Retrospective,et al.  The UNIX Time-sharing System , 1977 .

[10]  Chita R. Das,et al.  A closer look at coscheduling approaches for a network of workstations , 1999, SPAA '99.

[11]  William T. C. Kramer,et al.  Effective use of Cray supercomputers , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[12]  David A. Wood,et al.  Paging tradeoffs in distributed-shared-memory multiprocessors , 1994, Supercomputing '94.

[13]  Dror G. Feitelson,et al.  Job Scheduling in Multiprogrammed Parallel Systems , 1997 .

[14]  Ken Thompson,et al.  The UNIX time-sharing system , 1974, CACM.

[15]  Raymond M. Bryant,et al.  Operating system support for parallel programming on RP3 , 1991, IBM J. Res. Dev..

[16]  Alves Barbosa da Silva,et al.  Concurrent Gang : Towards a Flexible and Scalable Gang Scheduler Fabricio , 1999 .

[17]  Scott Pakin,et al.  Identifying and Eliminating the Performance Variability on the ASCI Q Machine , 2003 .

[18]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[19]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .