Runtime loop optimizations for locality and parallelism

The complexity of contemporary high-performance computers with deep memory hierarchies and multiple processors make it essential for system optimizations to help programmers make effective utilization of all the resources. Without optimizations, simple programs written in the more intuitive high level constructs of a programming language are not likely to fully realize the potential of high-performance computers. These optimizations are normally thought of as a task to be done at compile-time because the compiler can analyze the source code to un-ravel application-specific information that is not available at architecture design time. However, a problem with optimizations performed strictly at compile-time is that compile-time optimizations do not have much information about runtime conditions such as the values of variables and workload imbalances in a multiprocessor. Without knowing the runtime conditions, compilers must make conservative assumptions to ensure the correctness of the program under all runtime conditions. This limits the kinds of optimizations that a compiler can make. This thesis presents a runtime loop optimizations that out-performs existing optimizations for locality and parallelism. We focus on loops because loops constitute a majority of the computational time in scientific and engineering applications. While loop restructuring for locality and loop scheduling for parallelism are conventionally thought of as two distinct optimizations, we present a single runtime method for both of these optimizations. Furthermore, the optimizations are not limited to a single perfectly nested loop; the runtime optimizations we are presenting here can be applied to more complex loops and loop structures. Given that the compiler is rich in applications-specific knowledge and the runtime system is rich in knowledge of the runtime environment, we have taken the view that the compiler should describe what the computation is, while the runtime system should listen and execute, adapting to dynamic changes during the execution of the program. This requires that the compiler communicate application-specific information to the runtime system and that the runtime system maintains the data-structures to hold this information during execution. Static description and runtime representation of loops is the fundamental challenge this thesis addresses. We have designed a runtime system called DUDE (Def-Use Descriptor Environment) in which the compiler or the programmer describes the static loop structure and the dependencies between iterations in the iteration spaces. The runtime system uses this information to effect what we call a dependence-driven execution. In a dependence-driven execution, we begin with a pool of work consisting of unconstrained loop iterations. When different processors pick up chunks of iterations from the work queues and execute an operation on them their completion can enable new iterations if these iteration satisfy dependencies specified by the compiler. Dependence-driven execution has the desirable property that local actions have global effects. This leads to a reduction in global communications (improves locality) and global synchronization (improves parallelism). The same idea also leads to better performance in multi-programmed environments. Furthermore, the Def-User Descriptor Environment is an elegant method to achieve both task/data parallelism.