Improving parallelism and locality with asynchronous algorithms

As multicore chips become the main building blocks for high performance computers, many numerical applications face a performance impediment due to the limited hardware capacity to move data between the CPU and the off-chip memory. This is especially true for large computing problems solved by iterative algorithms because of the large data set typically used. Loop tiling, also known as loop blocking, was shown previously to be an effective way to enhance data locality, and hence to reduce the memory bandwidth pressure, for a class of iterative algorithms executed on a single processor. Unfortunately, the tiled programs suffer from reduced parallelism because only the loop iterations within a single tile can be easily parallelized. In this work, we propose to use the asynchronous model to enable effective loop tiling such that both parallelism and locality can be attained simultaneously. Asynchronous algorithms were previously proposed to reduce the communication cost and synchronization overhead between processors. Our new discovery is that carefully controlled asynchrony and loop tiling can significantly improve the performance of parallel iterative algorithms on multicore processors due to simultaneously attained data locality and loop-level parallelism. We present supporting evidence from experiments with three well-known numerical kernels.

[1]  Gérard M. Baudet,et al.  Asynchronous Iterative Methods for Multiprocessors , 1978, JACM.

[2]  John N. Tsitsiklis,et al.  Convergence rate and termination of asynchronous iterative algorithms , 1989, ICS '89.

[3]  D. Szyld,et al.  Asynchronous two-stage iterative methods , 1994 .

[4]  William Pugh,et al.  Exploiting Monotone Convergence Functions in Parallel Programs , 1996, LCPC.

[5]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Yuan Shi,et al.  Timing Models and Local Stopping Criteria for Asynchronous Iterative Algorithms , 1999, J. Parallel Distributed Comput..

[7]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[8]  Craig C. Douglas,et al.  A Tutorial on Elliptic Pde Solvers and Their Parallelization , 2003 .

[9]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[10]  Jingling Xue,et al.  Code tiling for improving the cache performance of PDE solvers , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[11]  David A. Padua,et al.  Programming for parallelism and locality with hierarchically tiled arrays , 2006, PPoPP '06.

[12]  Sadaf R. Alam,et al.  Characterization of Scientific Workloads on Systems with Multi-Core Processors , 2006, 2006 IEEE International Symposium on Workload Characterization.

[13]  Volker Strumpen,et al.  The memory behavior of cache oblivious stencil computations , 2007, The Journal of Supercomputing.

[14]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[15]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[16]  Yuan Shi,et al.  Parallel Processing of Linear Systems Using Asynchronous Iterative Algorithms , 2007 .

[17]  Zhiyuan Li,et al.  ASYNC Loop Constructs for Relaxed Synchronization , 2008, LCPC.

[18]  Lixia Liu,et al.  Analyzing memory access intensity in parallel programs on multicore , 2008, ICS '08.

[19]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[20]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.