论文信息 - Determining the idle time of a tiling

Determining the idle time of a tiling

This paper investigates the idle time associated with a parallel computation, that is, the time that processors are idle because they are either waiting for data from other processors or waiting to synchronize with other processors. We study doubly-nested loops corresponding to parallelogram- or trapezoidal-shaped iteration spaces that have been parallelized by the well-known tiling transformation. We introduce the notion of rise r, which relates the shape of the iteration space to that of the tiles. For parallelogram- shaped iteration spaces, we show that when r < -2, the idle time is linear in P, the number of processors, but when r > -1, it is quadratic in P. In the context of hierarchical tiling, where multiple levels of tiling are used, a good choice of rise can lead to less idle time and better performance. While idle time is not the only cost that should be considered in evaluating a tiling strategy, current architectural trends (of deeper memory hierarchies and multiple levels of parallelism) suggest it has increasing importance.

[1] Ken Kennedy,et al. Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[2] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[3] Yves Robert,et al. Determining the idle time of a tiling: new results , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[4] J. Ramanujam,et al. Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5] Ken Kennedy,et al. Optimizing for parallelism and data locality , 1992, ICS '92.

[6] Michael Wolfe,et al. Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[7] Santosh G. Abraham,et al. Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic , 1991, IEEE Trans. Parallel Distributed Syst..

[8] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[9] Fung F. Lee. Partitioning of Regular Computation on Multiprocessor Systems , 1990, J. Parallel Distributed Comput..

[10] Daniel A. Reed,et al. Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[11] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[12] Larry Carter,et al. Efficient Parallelism via Hierarchical Tiling , 1995, PPSC.

[13] Larry Carter,et al. Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[14] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.

[15] Monica S. Lam,et al. A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[16] Pankaj Mehra,et al. The effect of interrupts on software pipeline execution on message-passing architectures , 1996, ICS '96.

[17] Santosh G. Abraham,et al. Compiler techniques for data partitioning of sequentially iterated parallel loops , 1990, ICS '90.

[18] William Pugh,et al. Determining schedules based on performance estimation , 1993 .

[19] William Jalby,et al. Optimizing Matrix Operations on a Parallel Multiprocessor with a Memory Hierarchical System , 1986, ICPP.