Improving software pipelining with hardware support for self-spatial loads

Recent work in software pipelining in the presence of uncertain memory latencies has shown that using compilergenerated cache-reuse analysis to determine proper load latencies can improve performance significantly [14, 19, 9]. Even with reuse information, references with a stride-one access pattern in the cache (called self-spatial loads) have been treated as all cache hits or all cache misses rather than as a single cache miss followed by a few cache hits in the rest of the cache line. In this paper, we show how hardware support for loading two consecutive cache lines with one instruction (called a prefetching load) when directed by the compiler can significantly improve software pipelining for scientific program loops. On set of 79 Fortran loops when using prefetching loads, we observed an average performance improvement of 7% over assuming all self-spatial loads are cache misses (assuming all hits often gives worse performance than assuming all misses [14]). In addition, prefetching loads reduced floating-point register pressure by 31% and integer register pressure by 20%. As a result, we were able to software pipeline 31% more loops within modern register constraints (32 integer/32 floating-point registers) with prefetching loads. These results show that specialized prefetching load instructions have considerable potential to improve software pipelining for array-based scientific codes.

[1]  M. Rajagopalan,et al.  Software Pipelining: Petri Net Pacemaker , 1993, Architectures and Compilation Techniques for Fine and Medium Grain Parallelism.

[2]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[3]  Krishna Subramanian,et al.  Enhanced modulo scheduling for loops with conditional branches , 1992, MICRO 25.

[4]  A. Gonzalez,et al.  Cache sensitive module scheduling , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[5]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[6]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[7]  F. Jesús Sánchez,et al.  Cache Sensitive Modulo Scheduling , 1997, MICRO.

[8]  Keith D. Cooper,et al.  Effective partial redundancy elimination , 1994, PLDI '94.

[9]  Keith D. Cooper,et al.  Operator strength reduction , 2001, TOPL.

[10]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[11]  Keith D. Cooper,et al.  Value Numbering , 1997, Softw. Pract. Exp..

[12]  Philip H. Sweany,et al.  Modulo Scheduling with Cache Reuse Information , 1997, Euro-Par.

[13]  Ken Kennedy,et al.  Scalar replacement in the presence of conditional control flow , 1994, Softw. Pract. Exp..

[14]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.

[15]  Scott A. Mahlke,et al.  Reverse If-Conversion , 1993, PLDI '93.

[16]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[17]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.