Bridging Processor and Memory Performance in ILP Processors via Data-Remapping

Current system design trends continue to magnify the disparity between processor and memory performance. Thus, as microprocessors perform increasingly better than the memory systems supporting them, it is ever more important to bridge the performance gap to help translate the promise of Moore’s law into overall performance delivered to the end applications. This gap in performance between the processor and the memory is further exacerbated in the context of modern processors with high-levels of instruction level parallelism (ILP), especially for data-intensive applications. In these processors, increased demands for data delivery lead to concomitant needs for higher memory bandwidth and cache sizes. In this paper we provide a fast compile-time data-remapping technique which helps in bridging the gap between the ILP processor and its memory system, by enhancing the spatial locality of data-access. Our strategy is the first automatic approach applicable to pointer-intensive dynamic applications for which existing optimizations are mostly inadequate. We demonstrate an average performance improvement of 27% for several dataintensive applications. This is attributed to enhanced data locality, resulting in lowered demand on the bandwidth between cache levels, as well as between the cache subsystem and main memory. We also show that with increasing levels of ILP and fixed memory bandwidth, our remapping technique enables very high levels of performance with smaller cache sizes. For example, as much as a factor of 15 reduction in multi-level caches can be tolerated without a loss in performance. Although we use cycle-accurate simulators to detail the benefits of our remapping, we also measure 24% performance improvements for the Intel Pentium II and III processors, and a 9% yield on the Sun UltraSparc-II.

[1]  F. Jesús Sánchez,et al.  Cache Sensitive Modulo Scheduling , 1997, MICRO.

[2]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[3]  Ken Kennedy,et al.  The memory of bandwidth bottleneck and its amelioration by a compiler , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[4]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[5]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[6]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[7]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[8]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[9]  Michael Franz,et al.  Automated data-member layout of heap objects to improve memory-hierarchy performance , 2000, TOPL.

[10]  Susan J. Eggers,et al.  Balanced scheduling: instruction scheduling when memory latency is uncertain , 2004, SIGP.

[11]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[12]  Chris Wilkerson,et al.  Locality vs. criticality , 2001, ISCA 2001.

[13]  Roy Dz-Ching Ju,et al.  Characterization of Repeating Data Access Patterns in Integer Benchmarks , 2001 .

[14]  John C. Gyllenhaal,et al.  A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization , 1999, ISCA.

[15]  François Bodin,et al.  Improving cache behavior of dynamically allocated data structures , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[16]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[17]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[18]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[20]  A. Gonzalez,et al.  Cache sensitive module scheduling , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[21]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[22]  David G. Kirkpatrick,et al.  On the completeness of a generalized matching problem , 1978, STOC.

[23]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[24]  Erik Brunvand,et al.  Memory System Support for Irregular Applications , 1998, LCR.