The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale

We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collective operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence

[1]  David A. Wood,et al.  Paging tradeoffs in distributed-shared-memory multiprocessors , 1994, Supercomputing '94.

[2]  J. Greenwood Evolution in real time , 1994, Nature.

[3]  Ruud van der Pas,et al.  Memory Hierarchy in Cache-Based Systems , 2002 .

[4]  J. Fier,et al.  Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[5]  Terry Jones,et al.  Impacts of Operating Systems on the Scalability of Parallel Applications , 2003 .

[6]  Keith D. Underwood,et al.  A performance comparison of Linux and a lightweight kernel , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[7]  William T. C. Kramer,et al.  Performance Variability of Highly Parallel Architectures , 2003, International Conference on Computational Science.

[8]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[9]  Dhabaleswar K. Panda,et al.  Application-bypass reduction for large-scale clusters , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[10]  Paul Terry,et al.  Improving application performance on HPC systems with process synchronization , 2004 .

[11]  Ronald Minnich,et al.  Analysis of microbenchmarks for performance tuning of clusters , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[12]  Nisheeth K. Vishnoi,et al.  The Impact of Noise on the Scaling of Collectives: A Theoretical Approach , 2005, HiPC.

[13]  José E. Moreira,et al.  Blue Gene/L programming and operating environment , 2005, IBM J. Res. Dev..

[14]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.

[15]  Suzanne M. Kelly,et al.  Software Architecture of the Light Weight Kernel, Catamount , 2005 .

[16]  Susan Coghlan,et al.  Operating system issues for petascale systems , 2006, OPSR.

[17]  S. Dietrich,et al.  The Evolution of Real-Time Linux , .