Locality Optimizations for Parallel Computing Using Data Access Information

Given the large communication overheads characteristic of modern parallel machines, optimizations that improve locality by executing tasks close to data that they will access may improve the performance of parallel computations. This paper describes our experience automatically applying locality optimizations in the context of Jade, a portable, implicitly parallel programming language designed for exploiting task-level concurrency. Jade programmers start with a program written in a standard serial, imperative language, then use Jade constructs to declare how parts of the program access data. The Jade implementation uses this data access information to automatically extract the concurrency and apply locality optimizations. We present performance results for several Jade applications running on the Stanford DASH machine. We use these results to characterize the overall performance impact of the locality optimizations. In our application set the locality optimization level has little effect on the performance of two of the applications and a large effect on the performance of the rest of the applications. We also found that, if the locality optimization level had a significant effect on the performance, the maximum performance was obtained when the programmer explicitly placed tasks on processors rather than relying on the scheduling algorithm inside the Jade implementation.