MOCA: Memory Object Classification and Allocation in Heterogeneous Memory Systems

In the era of abundant-data computing, main memory's latency and power significantly impact overall system performance and power. Today's computing systems are typically composed of homogeneous memory modules, which are optimized to provide either low latency, high bandwidth, or low power. Such memory modules do not cater to a wide range of applications with diverse memory access behavior. Thus, heterogeneous memory systems, which include several memory modules with distinct performance and power characteristics, are becoming promising alternatives. In such a system, allocating applications to their best-fitting memory modules improves system performance and energy efficiency. However, such an approach still leaves the full potential of heterogeneous memory systems under-utilized because not only applications, but also the memory objects within that application differ in their memory access behavior. This paper proposes a novel page allocation approach to utilize heterogeneous memory systems at the memory object level. We design a memory object classification and allocation framework (MOCA) to characterize memory objects and then allocate them to their best-fit memory module to improve performance and energy efficiency. We experiment with heterogeneous memory systems that are composed of a Reduced Latency DRAM (RLDRAM) for latency-sensitive objects, a 2.5D-stacked High Bandwidth Memory (HBM) for bandwidth-sensitive objects, and a Low Power DDR (LPDDR) for non-memory-intensive objects. The MOCA framework includes detailed application profiling, a classification mechanism, and an allocation policy to place memory objects. Compared to a system with homogeneous memory modules, we demonstrate that heterogeneous memory systems with MOCA improve memory system energy efficiency by up to 63%. Compared to a heterogeneous memory system with only application-level page allocation, MOCA achieves a 26% memory performance and a 33% energy efficiency improvement for multi-program workloads.

[1]  Kieran McLaughlin,et al.  An RLDRAM II Implementation of a 10Gbps Shared Packet Buffer for Network Processing , 2007, Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007).

[2]  Sylvain Paris,et al.  Learning photographic global tonal adjustment with a database of input / output image pairs , 2011, CVPR 2011.

[3]  Xu Liu,et al.  Characterizing emerging heterogeneous memory , 2016, ISMM.

[4]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[5]  Serge J. Belongie,et al.  SD-VBS: The San Diego Vision Benchmark Suite , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Francky Catthoor,et al.  Placement of Linked Dynamic Data Structures over Heterogeneous Memories in Embedded Systems , 2015, TECS.

[7]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[9]  SoudrisDimitrios,et al.  Placement of Linked Dynamic Data Structures over Heterogeneous Memories in Embedded Systems , 2015 .

[10]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[12]  Onur Mutlu,et al.  Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance , 2006, IEEE Micro.

[13]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[14]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[15]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[16]  Pietro Perona,et al.  Pedestrian detection: A benchmark , 2009, CVPR.

[17]  Karsten Schwan,et al.  Software-controlled transparent management of heterogeneous memory resources in virtualized systems , 2013, MSPC '13.

[18]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[19]  Jeffrey K. Hollingsworth,et al.  Hardware monitors for dynamic page migration , 2008, J. Parallel Distributed Comput..

[20]  Zhen Fang,et al.  Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  Pat Conway,et al.  Blade computing with the AMD Opteron™ processor ("magny-cours") , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[22]  Karsten Schwan,et al.  Data tiering in heterogeneous memory systems , 2016, EuroSys.

[23]  Tom M. Mitchell,et al.  Machine learning classifiers and fMRI: A tutorial overview , 2009, NeuroImage.

[24]  Houman Homayoun,et al.  Heterogeneous memory management for 3D-DRAM and external DRAM with QoS , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[25]  Rajeev Barua,et al.  Heterogeneous memory management for embedded systems , 2001, CASES '01.

[26]  Yuan Xie,et al.  Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Jizeng Wei,et al.  Exploring new features of high-bandwidth memory for GPUs , 2016, IEICE Electron. Express.

[28]  Ada Gavrilovska,et al.  HeteroOS — OS design for heterogeneous memory management in datacenter , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[29]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[30]  Alex Ramírez,et al.  Data placement in HPC architectures with heterogeneous off-chip memory , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[31]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[32]  Joe Macri,et al.  AMD's next generation GPU and high bandwidth memory architecture: FURY , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[33]  Michael H. Kalantar,et al.  Managing the configuration complexity of distributed applications in Internet data centers , 2006, IEEE Communications Magazine.

[34]  Keke Gai,et al.  Cost-Aware Multimedia Data Allocation for Heterogeneous Memory Using Genetic Algorithm in Cloud Computing , 2020, IEEE Transactions on Cloud Computing.

[35]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[36]  S. Phadke,et al.  MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.