Feasibility of decoupling memory management from the execution pipeline

In conventional architectures, the central processing unit (CPU) spends a significant amount of execution time allocating and de-allocating memory. Efforts to improve memory management functions using custom allocators have led to only small improvements in performance. In this work, we test the feasibility of decoupling memory management functions from the main processing element to a separate memory management hardware. Such memory management hardware can reside on the same die as the CPU, in a memory controller or embedded within a DRAM chip. Using Simplescalar, we simulated our architecture and investigated the execution performance of various benchmarks selected from SPECInt2000, Olden and other memory intensive application suites. Hardware allocator reduced the execution time of applications by as much as 50%. In fact, the decoupled hardware results in a performance improvement even when we assume that both the hardware and software memory allocators require the same number of cycles. We attribute much of this improved performance to improved cache behavior since decoupling memory management functions reduces cache pollution caused by dynamic memory management software. We anticipate that even higher levels of performance can be achieved by using innovative hardware and software optimizations. We do not show any specific implementation for the memory management hardware. This paper only investigates the potential performance gains that can result from a hardware allocator.

[1]  K. M. Kavi,et al.  Intelligent Memory Manager Eliminates Cache Pollution Due to Memory Management Functions , 2003 .

[2]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[3]  Mohamed Shalan,et al.  A dynamic memory management unit for embedded real-time system-on-a-chip , 2000, CASES '00.

[4]  Emery D. Berger,et al.  A locality-improving dynamic memory allocator , 2005, MSP '05.

[5]  Sadiq M. Sait,et al.  A high-performance hardware-efficient memory allocation technique and design , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[6]  Krishna M. Kavi,et al.  Intelligent memory manager: Reducing cache pollution due to memory management functions , 2006, J. Syst. Archit..

[7]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Ron K. Cytron,et al.  Hardware Support for Fast and Bounded-Time Storage Allocation , 2002 .

[9]  Gurindar S. Sohi,et al.  Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[10]  Paul R. Wilson,et al.  Dynamic Storage Allocation: A Survey and Critical Review , 1995, IWMM.

[11]  Olivier Temam,et al.  Load squared: adding logic close to memory to reduce the latency of indirect loads with high miss ratios , 2005, MEDEA '04.

[12]  Katherine Yelick,et al.  A Case for Intelligent RAM: IRAM , 1997 .

[13]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[14]  Kathryn S. McKinley,et al.  Reconsidering custom memory allocation , 2002, OOPSLA '02.

[15]  J. Morris Chang,et al.  A High-Performance Memory Allocator for Object-Oriented Systems , 1996, IEEE Trans. Computers.