Intelligent memory manager: towards improving the locality behavior of allocation-intensive applications

Dynamic memory management required by allocation-intensive (i.e., Object Oriented and linked data structured) applications has led to a large number of research trends. Memory performance due to the cache misses in these applications continues to lag in terms of execution cycles as ever increasing CPU-Memory speed gap continues to grow. Sophisticated prefetching techniques, data relocations, and multithreaded architectures have tried to address memory latency. These techniques are not completely successful since they require either extra hardware/software in the system or special properties in the applications. Software needed for prefetching and data relocation strategies, aimed to improve cache performance, pollutes the cache so that the technique itself becomes counter-productive. On the other hand, extra hardware complexity needed in multithreaded architectures decelerates CPU's clock, since “Simpler is Faster”. This dissertation, directed to seek the cause of poor locality behavior of allocation-intensive applications, studies allocators and their impact on the cache performance of these applications. Our study concludes that service functions, in general, and memory management functions, in particular, entangle with application's code and become the major cause of cache pollution. In this dissertation, we present a novel technique that transfers the allocation and de-allocation functions entirely to a separate processor residing in chip with DRAM (Intelligent Memory Manager). Our empirical results show that, on average, 60% of the cache misses caused by allocation and de-allocation service functions are eliminated using our technique. We also show that internal fragmentation, extra memory over-allocated by the allocators, counters special locality of applications. We introduce “hybrid,” an exact fit allocator, which results in 25% cache miss reduction due to minimizing the internal fragmentation. Moreover, this work indicates that external fragmentation, inability to use the existing free space, indirectly affects the execution performance. We propose address ordered and segregrated binary tree allocators that exhibit high storage utilization and moderate execution performance to compare with existing allocators.

[1]  J. Morris Chang,et al.  A High-Performance Memory Allocator for Object-Oriented Systems , 1996, IEEE Trans. Computers.

[2]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[3]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[4]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[5]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[6]  Charles Crowley,et al.  Operating Systems: A Design-Oriented Approach , 1996 .

[7]  Witawas Srisa-an,et al.  A study of the allocation behavior of C++ programs , 2001, J. Syst. Softw..

[8]  Todd C. Mowry,et al.  Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation , 1999, ISCA.

[9]  Josep Torrellas,et al.  Automatic Code Mapping on an Intelligent Memory Architecture , 2001, IEEE Trans. Computers.

[10]  Trevor N. Mudge,et al.  High-Performance DRAMs in Workstation Environments , 2001, IEEE Trans. Computers.

[11]  Krishna M. Kavi,et al.  Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation , 2001, IEEE Trans. Computers.

[12]  James C. Browne,et al.  -9162/99/$10.00 1999 Ieee , 2022 .

[13]  Ajay K. Royyuru,et al.  Blue Gene: A vision for protein science using a petaflop supercomputer , 2001, IBM Syst. J..

[14]  M. Rezaei,et al.  A new implementation technique for memory management , 2000, Proceedings of the IEEE SoutheastCon 2000. 'Preparing for The New Millennium' (Cat. No.00CH37105).

[15]  Sally A. McKee,et al.  Smarter Memory: Improving Bandwidth for Streamed References , 1998, Computer.

[16]  Graem A. Ringwood,et al.  Garbage collecting the Internet: a survey of distributed garbage collection , 1998, CSUR.

[17]  Artur W. Klauser,et al.  Trends in high-performance microprocessor design , 2001 .

[18]  Jack L. Lo,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19]  J. Morris Chang,et al.  A hardware implementation of realloc function , 2000, Integr..

[20]  Krishna M. Kavi,et al.  A Non-Blocking Multithreaded Architecture , 1997 .

[21]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[22]  Jose Renau,et al.  Programming the FlexRAM parallel intelligent memory system , 2003, PPoPP '03.

[23]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[24]  Jacqueline Chame,et al.  Code Transformations for Exploiting Bandwidth in PIMBased Systems , 2000 .

[25]  Christoforos E. Kozyrakis,et al.  Overcoming the limitations of conventional vector processors , 2003, ISCA '03.

[26]  Maurice V. Wilkes,et al.  The memory wall and the CMOS end-point , 1995, CARN.

[27]  Benjamin G. Zorn,et al.  Memory allocation costs in large C and C++ programs , 1994, Softw. Pract. Exp..

[28]  Kiyoo Itoh,et al.  Limitations and challenges of multigigabit DRAM chip design , 1997, IEEE J. Solid State Circuits.

[29]  Kathryn S. McKinley,et al.  Composing high-performance memory allocators , 2001, PLDI '01.

[30]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[31]  C. J. Stephenson,et al.  New methods for dynamic storage allocation (Fast Fits) , 1983, SOSP '83.

[32]  Gurindar S. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, ISCA.

[33]  Kelvin D. Nilsen,et al.  Performance of a hardware-assisted real-time garbage collector , 1994, ASPLOS VI.

[34]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[35]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[36]  D. T. Marr,et al.  Hyper-threading technology architecture and microarchitecture : a hyperhtext history , 2002 .

[37]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[38]  J. Morris Chang,et al.  DMMX: Dynamic memory management extensions , 2002, J. Syst. Softw..

[39]  E. Von Puttkamer A Simple Hardware Buddy System Memory Allocator , 1975, IEEE Transactions on Computers.

[40]  Amitabh Srivastava,et al.  Analysis Tools , 2019, Public Transportation Systems.

[41]  Anoop Gupta,et al.  Interleaving: a multithreading technique targeting multiprocessors and workstations , 1994, ASPLOS VI.

[42]  Theo Ungerer,et al.  A survey of processors with explicit multithreading , 2003, CSUR.

[43]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[44]  Frederic T. Chong,et al.  Active pages: a computation model for intelligent memory , 1998, ISCA.

[45]  Joel S. Emer,et al.  Simultaneous multithreading: multiplying alpha performance , 1999 .

[46]  Thomas A. Standish Data Structure Techniques , 1980 .

[47]  Todd C. Mowry,et al.  Software-controlled multithreading using informing memory operations , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[48]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[49]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[50]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[51]  Paul R. Wilson,et al.  Dynamic Storage Allocation: A Survey and Critical Review , 1995, IWMM.

[52]  John L. Henning SPEC CPU2000: Measuring CPU Performance in the New Millennium , 2000, Computer.

[53]  Soo-Mook Moon,et al.  Memory allocation with lazy fits , 2000, ISMM '00.

[54]  Theo Ungerer,et al.  Context-switching techniques for decoupled multithreaded processors , 1999, Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium.

[55]  Paul R. Wilson,et al.  The memory fragmentation problem: solved? , 1998, ISMM '98.

[56]  Toru Shimizu,et al.  M32R/D-integrating DRAM and microprocessor , 1997, IEEE Micro.

[57]  J. C. Browne,et al.  The pressure is on [computer systems research] , 1999 .

[58]  Ron K. Cytron,et al.  Hardware Support for Fast and Bounded-Time Storage Allocation , 2002 .

[59]  K. M. Kavi,et al.  Intelligent Memory Manager Eliminates Cache Pollution Due to Memory Management Functions , 2003 .

[60]  Josep Torrellas,et al.  Automatically mapping code on an intelligent memory architecture , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[61]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[62]  Kenneth C. Knowlton,et al.  A fast storage allocator , 1965, CACM.

[63]  C. J. Stephenson Fast fits--new methods for dynamic storage allocation , 1983 .

[64]  Donald E. Knuth,et al.  The art of computer programming: V.1.: Fundamental algorithms , 1997 .

[65]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[66]  Krishna M. Kavi,et al.  Storage Allocation for Real-Time, Embedded Systems , 2001, EMSOFT.