Compile-time techniques for efficient utilization of parallel memories

The partitioning of shared memory into a number of memory modules is an approach to achieve high memory bandwidth for parallel processors. Memory access conflicts can occur when several processors simultaneously request data from the same memory module. Although work has been done to improve access performance for vectors, no work has been reported to improve the access performance of scalars. For systems in which the processors operate in a lock-step mode, a large percentage of memory access conflicts can be predicted at compile-time. These conflicts can be avoided by appropriate distribution of data among the memory modules at compile-time. A long instruction word machine is an example of a system in which the functional units operate in a lock-step mode performing operations on data fetched in parallel from multiple memory modules. In this paper, compile-time techniques for distribution of scalars to avoid memory access conflicts are presented. Furthermore, algorithms to schedule data transfers among memory modules to avoid conflicts that cannot be avoided by the distribution of values alone are developed. The techniques have been implemented as part of a compiler for a reconfigurable long instruction word architecture. Results of experiments are presented demonstrating that a very high percentage of memory access conflicts can be avoided by scheduling a very low number of data transfers.

[1]  Paul Budnik,et al.  The Organization and Use of Parallel Memories , 1971, IEEE Transactions on Computers.

[2]  David T. Harper,et al.  Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme , 1987, IEEE Transactions on Computers.

[3]  Rajiv Gupta A reconfigurable liw architecture and its compiler , 1987 .

[4]  Kenneth E. Batcher The Multidimensional Access Memory in STARAN , 1977, IEEE Transactions on Computers.

[5]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[6]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[7]  Joseph A. Fisher The VLIW Machine: A Multiprocessor for Compiling Scientific Code , 1984, Computer.

[8]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[9]  Ron Cytron,et al.  What's In a Name? -or- The Value of Renaming for Parallelism Detection and Storage Allocation , 1987, ICPP.

[10]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11]  Robert E. Tarjan,et al.  Decomposition by clique separators , 1985, Discret. Math..

[12]  David J. Kuck ILLIAC IV Software and Application Programming , 1968, IEEE Transactions on Computers.

[13]  R. Gupta,et al.  A matching approach to utilizing fine-grained parallelism , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track.

[14]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[15]  Rajiv Gupta,et al.  A Reconfigurable LIW Architecture , 1987, International Conference on Parallel Processing.

[16]  David J. Kuck,et al.  The Burroughs Scientific Processor (BSP) , 1982, IEEE Transactions on Computers.

[17]  Mary E. Mace Memory storage patterns in parallel processing , 1987, The Kluwer international series in engineering and computer science.