On Effective Execution of Nonuniform DOACROSS Loops

It is extremely difficult to parallelize DOACROSS loops with nonuniform loop-carried dependences. In this paper, we present a static scheduling scheme with an accompanying synchronization strategy that can execute such DOACROSS loops effectively and efficiently. Our approach uses one of the parallelization techniques called Dependence Uniformization, which finds a small set of uniform dependence vectors to cover all possible nonuniform dependences in a DOACROSS loop. It differs from the previous schemes in that we demonstrate a better way to select the uniform dependence vectors. When used with the Static Strip Scheduling scheme, the proposed uniform dependence vector set allows us to enforce dependences with more locality, which reduces the requirement of explicit synchronization considerably while retaining most of the parallelism. This paper describes the uniform dependence vectors selection strategy and the static strip scheduling scheme. The performance analysis and examples are also presented.

[1]  Z. Chen,et al.  On uniformization of affine dependence algorithms , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[2]  David Alejandro Padua Haiek Multiprocessors: discussion of some theoretical and practical problems , 1980 .

[3]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[4]  Ding-Kai Chen,et al.  Compiler optimizations for parallel loops with fine-grained synchronization , 1994 .

[5]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[6]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[7]  Yoichi Muraoka,et al.  Parallelism exposure and exploitation in programs , 1971 .

[8]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[9]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[10]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[11]  Pen-Chung Yew,et al.  Efficient Doacross execution on distributed shared-memory multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[12]  Zhiyu Shen,et al.  An Empirical Study of Fortran Programs for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..

[13]  Ron Cytron,et al.  Doacross: Beyond Vectorization for Multiprocessors , 1986, ICPP.

[14]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[15]  Ken Kennedy,et al.  Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.

[16]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[17]  Lionel M. Ni,et al.  Dependence Uniformization: A Loop Parallelization Technique , 1993, IEEE Trans. Parallel Distributed Syst..

[18]  P.-C. Yew,et al.  On data synchronization for multiprocessors , 1989, ISCA '89.

[19]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.