MARS: A Distributed Memory Approach to Shared Memory Compilation

This paper describes an automatic parallelising compiler, MARS, targeted for shared memory machines. It uses a data partitioning approach, traditionally used for distributed memory machines, in order to globally reduce overheads such as communication and synchronisation. Its high-level linear algebraic representation allows direct application of, for instance, unimodular transformations and global application of data transformation. Although a data based approach allows global analysis and in many instances outperforms local, loop-orientated parallelisation approaches, we have identified two particular problems when applying data parallelism to sequential Fortran 77 as opposed to data parallel dialects tailored to distributed memory targets. This paper describes two techniques to overcome these problems and evaluates their applicability. Preliminary results, on two SPECf92 benchmarks, show that with these optimisations, MARS outperforms existing state-of-the art loop based auto-parallelisation approaches.

[1]  Michael F. P. O'Boyle,et al.  A compiler algorithm to reduce invalidation latency in virtual shared memory systems , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[2]  Michael F. P. O'Boyle,et al.  Practical loop generation , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[3]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[4]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[5]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[6]  Michael F. P. O'Boyle A Data Partitioning Algorithm for Distributed Memory Compilation , 1994, PARLE.

[7]  Thierry Priol,et al.  KOAN: A Shared Virtual Memory for the iPSC/2 Hypercube , 1992, CONPAR.

[8]  Ron Cytron,et al.  What's In a Name? -or- The Value of Renaming for Parallelism Detection and Storage Allocation , 1987, ICPP.

[9]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[10]  Tarek S. Abdelrahman,et al.  Locality Enhancement for Large-Scale Shared-Memory Multiprocessors , 1998, LCR.

[11]  Ken Kennedy,et al.  Unified compilation of Fortran 77D and 90D , 1993, LOPL.

[12]  Michael F. P. O'Boyle,et al.  Synchronization Minimization in a SPMD Execution Model , 1995, J. Parallel Distributed Comput..

[13]  Michael F. P. O'Boyle,et al.  Integrating loop and data transformations for global optimisation , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[14]  Michal Cierniak,et al.  Validity of Interprocedural Data Remapping , 1996 .

[15]  Manish Gupta,et al.  On privatization of variables for data-parallel execution , 1997, Proceedings 11th International Parallel Processing Symposium.