Performance Implications of Synchronization Support for Parallel Fortran Programs

This paper studies the performance implications of architectural synchronization support for automatically parallelized numerical programs. As the basis for this work, the authors analyze the needs for synchronization in automatically parallelized numerical programs. The needs are due to task management, loop scheduling, barriers, and data dependency handling. They present synchronization algorithms for efficient execution of programs with nested parallel loops. Next, they identify how various hardware synchronization primitives can be used to satisfy these software synchronization needs. The synchronization primitives studied are test and set, fetch and add, exchange-byte and synchronization bus implementation of lock/unlock operations. Lastly, they ran experiments to quantify the impact of various architectural support on the performance of a bus-based shared memory multiprocessor running automatically parallelized numerical programs. They found that supporting an atomic fetch and add primitive in shared memory is as effective as supporting lock/unlock operations with a synchronization bus. Both achieve substantial performance improvement over the cases where atomic test and set and exchange-byte operations are supported in shared memory.

[1]  Rajiv Gupta The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS III.

[2]  Shreekant S. Thakkar,et al.  Synchronization algorithms for shared-memory multiprocessors , 1990, Computer.

[3]  Gurindar S. Sohi,et al.  Restricted Fetch and Φ operations for parallel processing , 1989, ICS '89.

[4]  Harry F. Jordan,et al.  Comparing barrier algorithms , 1989, Parallel Comput..

[5]  Harry F. Jordan Interpreting parallel processor performance measurements , 1985, PPSC.

[6]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[7]  Pen-Chung Yew,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Trans. Software Eng..

[8]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[9]  Pen-Chung Yew,et al.  The impact of synchronization and granularity on parallel systems , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[10]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[11]  Constantine D. Polychronopoulos The Impact of Run-Time Overhead on Usable Parallelism , 1988, ICPP.

[12]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[13]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[14]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[15]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[16]  John L. Hennessy,et al.  Characterizing the synchronization behavior of parallel programs , 1988, PPEALS '88.

[17]  Constantine D. Polychronopoulos,et al.  The Effect of Barrier Synchronization and Scheduling Overhead on Parallel Loops , 1989, ICPP.

[18]  Pen-Chung Yew,et al.  Cedar architecture and its software , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.