Invited Paper: A Compile-time Cost Model for OpenMP

OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory platforms. It is also a promising candidate to exploit the emerging multicore, multithreaded processors. In addition, there is an increasing trend to combine OpenMP with MPI to take full advantage of mainstream supercomputers consisting of clustered SMPs. All of these require that attention be paid to the quality of the compiler's translation of OpenMP and the flexibility of runtime support. Many compilers and runtime libraries have an internal cost model that helps evaluate compiler transformations, guides adaptive runtime systems, and helps achieve load balancing. But existing models are not sufficient to support OpenMP, especially on new platforms. In this paper we present our experience adapting the cost models in OpenUH, a branch of Open64, to estimate the execution cycles of parallel OpenMP regions using knowledge of both software and hardware. Our OpenMP cost model reuses major components from Open64, along with extensions to consider more OpenMP details. Preliminary evaluations of the model are presented using kernel benchmarks. The challenges and possible extensions for modeling OpenMP on multicore platforms are also discussed.

[1]  Alejandro Duran,et al.  Dynamic load balancing of MPI+OpenMP applications , 2004, International Conference on Parallel Processing, 2004. ICPP 2004..

[2]  A. Snavely,et al.  Modeling application performance by convolving machine signatures with application profiles , 2001, Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538).

[3]  Mary K. Vernon,et al.  Parallel program performance prediction using deterministic task graph analysis , 2004, TOCS.

[4]  Kathryn S. McKinley,et al.  A Compiler Optimization Algorithm for Shared-Memory Multiprocessors , 1998, IEEE Trans. Parallel Distributed Syst..

[5]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[6]  Dieter an Mey,et al.  Hybrid Parallelization of CFD Applications with Dynamic Thread Balancing , 2004, PARA.

[7]  Mary K. Vernon,et al.  Analytic evaluation of shared-memory systems with ILP processors , 1998, ISCA.

[8]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[9]  Mihai Burcea,et al.  An Adaptive OpenMP Loop Scheduler for Hyperthreaded SMPs , 2004, PDCS.

[10]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[11]  Ko-Yang Wang Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.

[12]  Mary K. Vernon,et al.  Poems: end-to-end performance design of large parallel adaptive computational systems , 1998, WOSP '98.

[13]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[14]  Peter Naur IFIP Working Group on ALGOL , 1962 .

[15]  Barbara M. Chapman,et al.  Evaluating OpenMP on Chip MultiThreading Platforms , 2005, IWOMP.

[16]  Amer Diwan,et al.  SUIF Explorer: an interactive and interprocedural parallelizer , 1999, PPoPP '99.

[17]  Ruoming Jin,et al.  A methodology for detailed performance modeling of reduction computations on SMP machines , 2005, Perform. Evaluation.

[18]  Ruoming Jin,et al.  Performance prediction for random write reductions: a case study in modeling shared memory programs , 2002, SIGMETRICS '02.

[19]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[20]  Bruce M. Maggs,et al.  Proceedings of the 28th Annual Hawaii International Conference on System Sciences- 1995 Models of Parallel Computation: A Survey and Synthesis , 2022 .

[21]  Wenguang Chen,et al.  OpenUH: an optimizing, portable OpenMP compiler , 2007, Concurr. Comput. Pract. Exp..

[22]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[23]  Michael Voss,et al.  Reducing parallel overheads through dynamic serialization , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[24]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[25]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[26]  Alan Jay Smith,et al.  Analysis of benchmark characteristics and benchmark performance prediction , 1996, TOCS.

[27]  Gang Ren,et al.  A comparison of empirical and model-driven optimization , 2003, PLDI '03.

[28]  Dingxing Wang,et al.  ORC-OpenMP: An OpenMP Compiler Based on ORC , 2004, International Conference on Computational Science.

[29]  Israel Koren,et al.  An analytical model of high performance superscalar-based multiprocessors , 1995, PACT.