Porting and Developing Parallel Applications

This chapter describes the porting and development of parallel applications. Depending on the machine architecture, there are a number of tasks that must be accomplished to obtain an efficient parallel version of an application. For machines where the processor has local memory, data layout across the processor array is perhaps the most complex and least intuitive task to programmers with a sequential background. Even in shared memory machines, data layout and access patterns can be important but for a different reason. It is necessary to make efficient use of the processors local cache. Programs that manage to keep memory references tightly clustered in a small number of memory regions will use the cache much more efficiently than programs whose memory reference pattern is widely dispersed. The goal of automatic parallelization is to relieve the programmer of as much responsibility as possible for these parallelizing tasks. In the rosiest scenario, the programmer need not worry about these matters at all. The compiler would accept dusty deck codes and produce efficient parallel object code without any additional work by the programmer.