Compiler Support for Optimizing Memory Bank-Level Parallelism

Many prior compiler-based optimization schemes focused exclusively on cache data locality. However, cache locality is only one part of the overall performance of applications running on emerging multicores or many cores. For example, memory stalls could constitute a very large fraction of execution time even in cache-optimized codes, and one of the main reasons for this is lack of memory-level parallelism. Motivated by this, we propose a compiler-based Bank-Level Parallelism (BLP) optimization scheme that uses loop tile scheduling. More specifically, we first use Cache Miss Equations to predict where the last-level cache miss will happen in each tile, and then identify the set of memory banks that will be accessed in each tile. Using this information, two tile scheduling algorithms are proposed to maximize BLP, each targeting a different scenario. We further discuss how our compiler-based scheme can be enhanced to consider memory controller-level parallelism and row-buffer locality. Our experimental evaluation using 11 multithreaded applications shows that the proposed BLP optimization can improve average BLP by 17.1% on average, resulting in a 9.2% reduction in average memory access latency. Furthermore, considering memory controller-level parallelism and row-buffer locality (in addition to BLP) takes our average improvement in memory access latency to 22.2%.

[1]  Jongmoo Choi,et al.  Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems , 2013, ASPLOS '13.

[2]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[3]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Mahmut T. Kandemir,et al.  Reshaping cache misses to improve row-buffer locality in multicore systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[5]  Zhao Zhang,et al.  A performance comparison of DRAM memory system optimizations for SMT processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[6]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[7]  Monica S. Lam,et al.  Maximizing parallelism and minimizing synchronization with affine transforms , 1997, POPL '97.

[8]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[9]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[11]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[12]  Uday Bondhugula,et al.  Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors , 2009, PPoPP '09.

[13]  Dam Sunwoo,et al.  Balancing DRAM locality and parallelism in shared memory CMP systems , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[14]  Xiaobing Feng,et al.  Software-Hardware Cooperative DRAM Bank Partitioning for Chip Multiprocessors , 2010, NPC.

[15]  James H. Anderson,et al.  Real-Time Scheduling on Multicore Platforms , 2006, 12th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS'06).

[16]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[17]  Marco Caccamo,et al.  Memory-centric scheduling for multicore hard real-time systems , 2012, Real-Time Systems.

[18]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[19]  Mor Harchol-Balter,et al.  ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[20]  Christoforos E. Kozyrakis,et al.  Future scaling of processor-memory interfaces , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[21]  Mahmut T. Kandemir,et al.  On-chip cache hierarchy-aware tile scheduling for multicore machines , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[22]  Scott Rixner,et al.  Memory Controller Optimizations for Web Servers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[23]  Lei Liu,et al.  A software memory partition approach for eliminating bank-level interference in multicore systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[25]  Xu Cheng,et al.  Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[26]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[27]  Onur Mutlu,et al.  Improving memory Bank-Level Parallelism in the presence of prefetching , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[29]  S. Phadke,et al.  MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.

[30]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[31]  Xing Zhou,et al.  Hierarchical overlapped tiling , 2012, CGO '12.

[32]  Christoforos E. Kozyrakis,et al.  Improving System Energy Efficiency with Memory Rank Subsetting , 2012, TACO.

[33]  Mahmut T. Kandemir,et al.  Cache topology aware computation mapping for multicores , 2010, PLDI '10.

[34]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[35]  Mahmut T. Kandemir,et al.  Addressing End-to-End Memory Access Latency in NoC-Based Multicores , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[36]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[37]  Sarita V. Adve,et al.  Code transformations to improve memory parallelism , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[38]  Niladrish Chatterjee,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS 2010.

[39]  Uday Bondhugula,et al.  Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[40]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[41]  Norman P. Jouppi,et al.  Rethinking DRAM design and organization for energy-constrained multi-cores , 2010, ISCA.

[42]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[43]  Onur Mutlu,et al.  A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[44]  Onur Mutlu,et al.  DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[45]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[46]  Yongbing Huang,et al.  A Study of Leveraging Memory Level Parallelism for DRAM System on Multi-core/Many-Core Architecture , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.