论文信息 - ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near-Memory Processing with Inter-DIMM Broadcast

ABC-DIMM: Alleviating the Bottleneck of Communication in DIMM-based Near-Memory Processing with Inter-DIMM Broadcast

Near-Memory Processing (NMP) systems that integrate accelerators within DIMM (Dual-Inline Memory Module) buffer chips potentially provide high performance with relatively low design and manufacturing costs. However, an inevitable communication bottleneck arises when considering the main memory bus among peer DIMMs and the host CPU. This communication bottleneck roots in the bus-based nature and the limited point-to-point communication pattern of the main memory system. The aggregated memory bandwidth of DIMM- based NMP scales with the number of DIMMs. When the number of DIMMs in a channel scales up, the per-DIMM point-to-point communication bandwidth scales down, whereas the computation resources and local memory bandwidth per DIMM stay the same. For many important sparse data-intensive workloads like graph applications and sparse tensor algebra, we identify that communication among DIMMs and the host CPU easily dominates their processing procedure in previous DIMM-based NMP systems, which severely bottlenecks their performance.To tackle this challenge, we propose that inter-DIMM broadcast should be implemented and utilized in the main memory system of DIMM-based NMP. On the hardware side, the main memory bus naturally scales out with broadcast, where per- DIMM effective bandwidth of broadcast remains the same as the number of DIMMs grows. On the software side, many sparse applications can be implemented in a form such that broadcasts dominate their communication. Based on these ideas, we design ABC-DIMM, which Alleviates the Bottleneck of Communication in DIMM-based NMP, consisting of integral broadcast mechanisms and Broadcast-Process programming framework, with minimized modifications to commodity software-hardware stack. Our evaluation shows that ABC-DIMM offers an 8.33 × geo-mean speedup over a 16-core CPU baseline, and outperforms two NMP baselines by 2.59 × and 2.93 × on average.

[1] Keshav Pingali,et al. Parallel graph analytics , 2016, Commun. ACM.

[2] Kai Li,et al. Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5] Christoforos E. Kozyrakis,et al. GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence, Second Edition , 2020, A Primer on Memory Consistency and Cache Coherence.

[7] Minsoo Rhu,et al. TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.

[8] Christoforos E. Kozyrakis,et al. A case for intelligent RAM , 1997, IEEE Micro.

[9] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10] Dean M. Tullsen,et al. Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[11] Michael F. Deering,et al. FBRAM: a new form of memory optimized for 3D graphics , 1994, SIGGRAPH.

[12] Sparsh Mittal. A Survey of Techniques for Cache Locking , 2016, ACM Trans. Design Autom. Electr. Syst..

[13] Hyuk-Jae Lee,et al. An Effective DRAM Address Remapping for Mitigating Rowhammer Errors , 2019, IEEE Transactions on Computers.

[14] Sally A. McKee,et al. Hitting the memory wall: implications of the obvious , 1995, CARN.

[15] David A. Patterson,et al. Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[16] William J. Dally,et al. Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[17] Milan M. Jovanovic,et al. An overview of reflective memory systems , 1999, IEEE Concurr..

[18] Mingyu Gao,et al. HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[20] Mike Ignatowski,et al. TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[21] Richard B. Gillett. Memory Channel Network for PCI , 1996, IEEE Micro.

[22] Yanzhi Wang,et al. GraphQ: Scalable PIM-Based Graph Processing , 2019, MICRO.

[23] Jung Ho Ahn,et al. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[25] Kunle Olukotun,et al. Simplifying Scalable Graph Processing with a Domain-Specific Language , 2014, CGO '14.

[26] Rajeev Balasubramonian,et al. Innovations in the Memory System , 2019, Synthesis Lectures on Computer Architecture.

[27] Feifei Li,et al. Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads , 2014, IEEE Micro.

[28] Valeria Bertacco,et al. GraphVine: Exploiting Multicast for Scalable Graph Analytics , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[29] William J. Dally,et al. Scaling the Power Wall: A Path to Exascale , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[31] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[32] Onur Mutlu,et al. Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[33] Onur Mutlu,et al. A case for exploiting subarray-level parallelism (SALP) in DRAM , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[34] Dong Li,et al. DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[36] Christoforos E. Kozyrakis,et al. ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[37] Babak Falsafi,et al. The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[38] Shashank Shekhar,et al. Opportunity for compute partitioning in pursuit of energy-efficient systems , 2016, LCTES.

[39] Frederic T. Chong,et al. Active pages: a computation model for intelligent memory , 1998, ISCA.

[40] Jinjun Xiong,et al. Application-Transparent Near-Memory Processing Architecture with Memory Channel Network , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[41] Jung Ho Ahn,et al. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[42] Mattan Erez,et al. Near Data Acceleration with Concurrent Host Access , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[43] No License,et al. Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[44] Yuan Xie,et al. MEDAL: Scalable DIMM based Near Data Processing Accelerator for DNA Seeding Algorithm , 2019, MICRO.