PUMA: Pseudo unified memory architecture for single-ISA heterogeneous multi-core systems

Single-ISA heterogeneous multi-core processors have advantages over cost-equivalent homogeneous ones, which integrate cores having the same instruction set architecture (ISA) but offer different performance and power characteristics. When these cores share the off-chip main memory, requests from different cores will interfere with each other, leading to low system performance and unfairness even starvation. Unfortunately, state-of-the-art memory scheduling and thread scheduling algorithms are ineffective at solving these problems. This paper proposes a fundamentally new memory architecture of pseudo unified memory (PUMA), which partitions the memory into regions according cores' different performance, each core mostly requests only one memory region seldom exceeding, reducing interfere among cores while retaining bank level parallelism for improving performance and fairness. We evaluate the design trade-offs involved in our PUMA and compare it against three state-of-the-art memory management methods. Our experimental results show that PUMA improves both system performance and fairness among cores while reducing memory power.

[1]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[2]  Trevor Mudge,et al.  Modern dram architectures , 2001 .

[3]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[4]  Alexandra Fedorova,et al.  Addressing shared resource contention in multicore processors via scheduling , 2010, ASPLOS XV.

[5]  Lizy Kurian John,et al.  Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[7]  Onur Mutlu,et al.  Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[8]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[10]  Dam Sunwoo,et al.  Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[11]  Trevor N. Mudge,et al.  High-Performance DRAMs in Workstation Environments , 2001, IEEE Trans. Computers.

[12]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  O. Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[15]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[16]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures for multithreaded workload performance , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[17]  Tao Li,et al.  Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Tong Li,et al.  Using OS Observations to Improve Performance in Multicore Systems , 2008, IEEE Micro.

[19]  Manuel Prieto,et al.  Leveraging workload diversity through OS scheduling to maximize performance on single-ISA heterogeneous multicore systems , 2011, J. Parallel Distributed Comput..

[20]  Qingyuan Deng,et al.  MemScale: active low-power modes for main memory , 2011, ASPLOS XVI.

[21]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[22]  Richard Crisp,et al.  Direct RAMbus technology: the new main memory standard , 1997, IEEE Micro.

[23]  Tajana Simunic,et al.  vGreen: a system for energy efficient computing in virtualized environments , 2009, ISLPED.

[24]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).