Distributed Shared Memory and Compiler-Induced Scalable Locality for Scalable Cluster Performance

Distributed shared memory software allows a cluster to function as a single collection of many processing cores with a large physical memory, but highly unusual performance parameters: communication latency and bandwidth between nodes may be several orders of magnitude worse than on-chip. Thus, effective use of such systems requires computation/communication ratios many times higher. The loop optimization known as "time skewing" or "time tiling" can, for some codes, produce arbitrarily high compute balance. It should thus allow scalable high performance regardless of memory and network bandwidth limitations. We have been exploring the scalability of time tiling on homogeneous dedicated clusters, considering the effects of scaling both the number of nodes in the cluster and the ratio of computation speed to network bandwidth. Even with simple 1- and 2-d Jacobi stencil computations, there are challenges to practical realization of the prediction of scalability.