Using reconfigurable hardware to customize memory hierarchies

Over the past decade or more, processor speeds have increased much more quickly than memory speeds. As a result, a large, and still increasing, processor-memory performance gap has formed. Many significant applications suffer from substantial memory bottlenecks, and their memory performance problems are often either too unusual or extreme to be mitigated by cache memories along. Such specialized performance 'bugs' require specialized solutions, but it is impossible to provide case-by-case memory hierarchies or caching strategies on general-purpose computers. We have investigated the potential of implementing mechanisms like victim caches and prefetch buffers in reconfigurable hardware to improve application memory behavior. Based on technology and commercial trends, our simulation-based studies use a forward-looking model in which configurable logic is located on the CPU chip. Given such assumptions, our results show that the flexibility of being able to specialize configurable hardware to an application's memory referencing behavior more than balances the slightly slower response times of configurable memory hierarchy structures. For our three applications, small, specialized memory hierarchy additions such as victim caches and prefetch buffers can reduce miss rates substantially and can drop total execution times for these programs to between 60 and 80% of their original execution times. Our results also indicate that different memory specializations may be most effective for each application; this highlights the usefulness of configurable memory hierarchies that are specialized on a per-application basis.

[1]  Eduardo Sanchez,et al.  Spyder: a reconfigurable VLIW processor using FPGAs , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[2]  John D. Villasenor,et al.  Issues in wireless video coding using run-time-reconfigurable FPGAs , 1995, Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[3]  Jean Vuillemin,et al.  Introduction to programmable active memories , 1990 .

[4]  P.M. Athanas,et al.  Real-Time Image Processing on a Custom Computing Platform , 1995, Computer.

[5]  André DeHon,et al.  DPGA-coupled microprocessors: commodity ICs for the early 21st Century , 1994, Proceedings of IEEE Workshop on FPGA's for Custom Computing Machines.

[6]  Paul A. Dunn A Configurable Logic Processor for Machine Vision , 1995, FPL.

[7]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[8]  Per Stenström,et al.  Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[9]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[10]  Michael D. Smith,et al.  PRISC software acceleration techniques , 1994, Proceedings 1994 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[11]  Harvey F. Silverman,et al.  Processor reconfiguration through instruction-set metamorphosis , 1993, Computer.

[12]  Ruben W. Castelino,et al.  Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor , 1995, Digit. Tech. J..

[13]  Daniel P. Lopresti,et al.  Building and using a highly parallel programmable logic array , 1991, Computer.

[14]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[15]  Dzung T. Hoang,et al.  Searching genetic databases on Splash 2 , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[16]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[17]  T. Mowry,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[18]  Dean M. Tullsen,et al.  Limitations Of Cache Prefetching On A Bus-based Multiprocessor , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[19]  Peter M. Athanas,et al.  Quantitative analysis of floating point arithmetic on FPGA based custom computing machines , 1995, Proceedings IEEE Symposium on FPGAs for Custom Computing Machines.

[20]  K.M. Dixit New CPU benchmark suites from SPEC , 1992, Digest of Papers COMPCON Spring 1992.

[21]  Christophe Beaumont Using FPGAs as Control Support in MIMD Executions , 1995, FPL.