Performance transformations for irregular applications

Irregular computations occur in several important science and engineering applications: molecular dynamics simulations, n-body problems, Finite Element Analysis, etc. The observed performance in such irregular applications is typically cited as 10% or less of the advertised peak performance of current computer architectures. These applications frequently use compact sparse matrix formats that inhibit compile-time transformation opportunities. Sparse matrix formats inhibit the static analysis required by compile-time transformations, because their use introduces non-affine memory references such as A[B[i]]. Such memory references result in memory access patterns that can not be determined at compile-time. Therefore, run-time data and computation reordering transformations are needed to improve data locality and exploit parallelism, both of which are essential for improved program performance on current computer architectures. Run-time reordering transformations are implemented with inspectors and executors. The inspector traverses the memory reference pattern at run-time, generates data and computation reordering functions based on the observed pattern, creates new schedules, and remaps affected data structures accordingly. The executor is a transformed version of the original program that uses the schedules and remapped data structures generated by the inspector. The challenges inherent in using run-time reordering transformations are the need to amortize the overhead of the inspector, and the automatic composition and generation of the inspector/executor code. This dissertation makes three main contributions: the development of a run-time reordering transformation for data locality and parallelism called full sparse tiling; a description of technical and software engineering issues that occur when incorporating run-time reordering transformations such as full sparse tiling into existing software packages; and a framework for composing full sparse tiling and other run-time reordering transformations at compile-time. Our results show that on scientific computing benchmarks, compositions of run-time reordering transformations can result in significant performance improvements in serial and parallel processing environments, and that it is possible to amortize the overhead of the inspector.