Automatic code partitioning for distributed-memory multiprocessors (dmms)

The advances in hardware design of parallel computers have not been followed by corresponding advances in software to program these machines. This is especially true for Distributed Memory Multiprocessors (DMMs), for which there is no shared memory that can be used by all the Processing Elements (PEs). High-level programming abstractions for these machines are almost non-existent, leaving the programmers the task of explicitly programming these architectures using machine-dependent, low-level abstractions. This approach is error-prone and forces the programmer to deal with many details outside of the application domain. More precisely, the programmer has to deal with all parallel processing tasks required to program the parallel machine. These tasks include explicit partitioning of the program code into parallel tasks, scheduling these tasks on the PEs, synchronization, and explicit distribution of data among the PEs and insertion of the appropriate message passing calls needed to exchange data from one remote memory to another. Much effort is being done to make the parallel processing tasks mentioned above be done automatically by the compiler of the parallel machine. This way, the user doesn't have to know the details of the architecture of the machine. His/her main concern is the specification of the algorithm for solving the problem. Two of the main phases of the compiler are the code partitioning and scheduling phases. Lots of solutions have been proposed regarding the scheduling phase. However, much more needs to be done regarding the code partitioning phase. Most existing work regarding the partitioning problem either considers a specific application and try to come up with an efficient partitioning scheme for it (i.e. no automatic partitioning), or come up with a general solution (automatic partitioning) that is too simple and therefore not efficient (e.g. exploits only one kind of parallelism level). Our research deals with the code partitioning phase of the compiler. We propose a data-flow based partitioning method where all levels of parallelism are exploited. Given a Directed Acyclic Graph (DAG) representation of the program, we propose a procedure that automatically determines the granularity of parallelism by partitioning the graph into tasks to be scheduled on the DMM. The granularity of parallelism depends only on the program to be executed and on the target machine parameters. The output of our algorithm is passed on as input to the scheduling phase. Finding an optimal solution to this problem is NP-complete. Due to the high cost of graph algorithms, it is nearly impossible to come up with close to optimal solutions that don't have very high cost (higher order polynomial). Therefore, we propose heuristics that give good performance and that have relatively low cost.