Performance models for Cluster-enabled OpenMP implementations

A key issue for cluster-enabled OpenMP implementations based on software distributed shared memory (sDSM) systems, is maintaining the consistency of the shared memory space. This forms the major source of overhead for these systems, and is driven by the detection and servicing of page faults. This paper investigates how application performance can be modelled based on the number of page faults. Two simple models are proposed, one based on the number of page faults along the critical path of the computation, and one based on the aggregated numbers of page faults. Two different sDSM systems are considered. The models are evaluated using the OpenMP NAS parallel benchmarks on an 8-node AMD-based Gigabit Ethernet cluster. Both models gave estimates accurate to within 10% in most cases, with the critical path model showing slightly better accuracy; accuracy is lost if the underlying page faults cannot be overlapped, or if the application makes extensive use of the OpenMP flush directive.

[1]  Mitsuhisa Sato,et al.  Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system , 2001, Sci. Program..

[2]  Mitsuhisa Sato,et al.  Openmp Compiler for a Software Distributed Shared Memory System Scash , 2000 .

[3]  Jeffrey K. Hollingsworth,et al.  Critical Path Profiling of Message Passing and Shared-Memory Programs , 1998, IEEE Trans. Parallel Distributed Syst..

[4]  Weng-Fai Wong,et al.  The performance model of SilkRoad - a multithreaded DSM system for clusters , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[5]  Alistair P. Rendell,et al.  Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency Costs , 2008, IWOMP.

[6]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[7]  Alistair P. Rendell,et al.  The design of MPI based distributed shared memory systems to support OpenMP on clusters , 2007, 2007 IEEE International Conference on Cluster Computing.

[8]  M. Schulz,et al.  Extracting Critical Path Graphs from MPI Applications , 2005, 2005 IEEE International Conference on Cluster Computing.

[9]  Mats Brorsson,et al.  Predicting the Performance of Distributed Virtual Shared-Memory Applications , 1997, IBM Syst. J..

[10]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[11]  Dhabaleswar K. Panda,et al.  Microbenchmark performance comparison of high-speed cluster interconnects , 2004, IEEE Micro.