jitSim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter

Traditionally, Operating system jitter has been a source of performance degradation for parallel applications running on large number of processors. While some large scale HPC systems such as Blue Gene/L and Cray XT4, mitigate jitter by making use of a specialized light-weight operating system on compute nodes, other clusters have attempted using HPC-ready commodity operating systems such as ZeptoOS (based on Linux). However, as large systems continue to be designed to work with commodity OSes, OS jitter still remains an active area of research within the HPC community. While, it is true that some of the specialized commodity OSes like ZeptoOS have relatively low OS jitter levels, there is still a need to have a quick and easy set of tools that can predict the impact of OS jitter at a given configuration and processor number. Such tools are also required to validate and compare any new techniques or OS enhancements that mitigate jitter. Emulating jitter on a large "jitter-free" platform using either synthetic jitter or real traces from commodity OSes has been proposed as one useful mechanism to study scalability behavior under the presence of jitter. However, this requires access to large scale jitter free systems, which are few in number and not so easily accessible. As new systems are built, that should scale up to a million tasks and more, the emulation approach is still limited by the largest jitter free system available. In this paper we present jitSim - a simulation framework for predicting scalability of parallel compute intensive applications in presence of OS jitter using trace driven simulation. The jitter simulation framework can be used to quickly simulate the effects of jitter that is characteristic of a given OS using a given trace. Furthermore, this system can be used to predict scalability up to any arbitrarily large number of task counts. Our methodology comprises of collection of real jitter traces, measurement of network latency, message passing stack latency, and shared memory latency. The simulation framework takes the above as inputs and then simulates multiple parallel tasks starting at randomly chosen points in the jitter trace and executing a compute phase. We validate the simulation results by comparing it with real data and demonstrate the efficacy of the simulation framework by evaluating various jitter mitigation techniques through simulation.

[1]  Ravi Kothari,et al.  Identifying sources of Operating System Jitter through fine-grained kernel instrumentation , 2007, 2007 IEEE International Conference on Cluster Computing.

[2]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[3]  J. Fier,et al.  Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[4]  Paul Terry,et al.  Improving application performance on HPC systems with process synchronization , 2004 .

[5]  Graham R. Nudd,et al.  Pace—A Toolset for the Performance Prediction of Parallel and Distributed Systems , 2000, Int. J. High Perform. Comput. Appl..

[6]  David A. Bader,et al.  Performance analysis of parallel programs via message-passing graph traversal , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Terry Jones,et al.  Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[8]  G. Johnson,et al.  A Performance Comparison Through Benchmarking and Modeling of Three Leading Supercomputers: Blue Gene/L, Red Storm, and Purple , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[9]  Yves Robert,et al.  High Performance Computing - HiPC 2006, 13th International Conference, Bangalore, India, December 18-21, 2006, Proceedings , 2006, HiPC.

[10]  Susan Coghlan,et al.  Operating system issues for petascale systems , 2006, OPSR.

[11]  Jesús Labarta,et al.  Sensitivity of Performance Prediction of Message Passing Programs , 2004, The Journal of Supercomputing.

[12]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, HiPC 2008.

[13]  Susan Coghlan,et al.  The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale , 2006, 2006 IEEE International Conference on Cluster Computing.

[14]  Ron Brightwell,et al.  Characterizing application sensitivity to OS interference using kernel-level noise injection , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Pradipta De,et al.  Handling OS jitter on multicore multithreaded systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[17]  Pradipta De,et al.  Impact of Noise on Scaling of Collectives: An Empirical Evaluation , 2006, HiPC.

[18]  Stephen A. Jarvis,et al.  WARPP: a toolkit for simulating high-performance parallel scientific codes , 2009, SimuTools.

[19]  Ravi Kothari,et al.  A trace-driven emulation framework to predict scalability of large clusters in presence of OS Jitter , 2008, 2008 IEEE International Conference on Cluster Computing.