Compiling for the Impulse memory controller

The Impulse memory controller provides an interface for remapping irregular or sparse memory accesses into dense accesses in the cache memory. This capability significantly increases processor cache and system bus utilization, and previous work shows performance improvements from a factor of 1.2 to 5 with current technology models for hand-coded kernels in a cycle-level simulator. To attain widespread use of any specialized hardware feature requires automating its use in a compiler. We present compiler cost models using dependence and locality analysis that determine when to use Impulse to improve performance based on the reduction in misses, the additional cost for misses in Impulse, and the fixed cost for setting up a remapping. We implement the cost models and generate the appropriate Impulse system calls in the Scale compiler framework. Our results demonstrate that our cost models correctly choose when and when not to use Impulse. We also combine and compare Impulse with our implementation of loop permutation for improving locality. If loop permutation can achieve the same dense access pattern as Impulse, we prefer it, since it has no overheads, but we show that the combination can yield better performance.

[1]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[2]  Wilson C. Hsieh,et al.  Impulse: Memory system support for scientific applications , 1999, Sci. Program..

[3]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[4]  Andrew A. Chien,et al.  Architectural adaptation for application-specific locality optimizations , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[5]  Mahmut T. Kandemir,et al.  A matrix-based approach to the global locality optimization problem , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[6]  Sally A. McKee,et al.  Cost-Model Driven Integration of Restructuring Optimizations , 2001, PACT 2001.

[7]  Sally A. McKee,et al.  A cost framework for evaluating integrated restructuring optimizations , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[8]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[10]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[11]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[12]  Sarita V. Adve,et al.  RSIM Reference Manual: Version 1.0 , 1997 .

[13]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[14]  Katherine Yelick,et al.  SCALABLE PROCESSORS IN THE BILLION-TRANSISTOR THE BILLION-TRANSISTOR ERA :IRAM , 1997 .

[15]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[16]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[17]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[18]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[19]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[20]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.

[21]  CarterJohn,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998 .