CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators

Specialized on-chip accelerators are widely used to improve the energy efficiency of computing systems. Recent advances in memory technology have enabled near-data accelerators (NDAs), which reside off-chip close to main memory and can yield further benefits than on-chip accelerators. However, enforcing coherence with the rest of the system, which is already a major challenge for accelerators, becomes more difficult for NDAs. This is because (1) the cost of communication between NDAs and CPUs is high, and (2) NDA applications generate a lot of off-chip data movement. As a result, as we show in this work, existing coherence mechanisms eliminate most of the benefits of NDAs. We extensively analyze these mechanisms, and observe that (1) the majority of off-chip coherence traffic is unnecessary, and (2) much of the off-chip traffic can be eliminated if a coherence mechanism has insight into the memory accesses performed by the NDA. Based on our observations, we propose CoNDA, a coherence mechanism that lets an NDA optimistically execute an NDA kernel, under the assumption that the NDA has all necessary coherence permissions. This optimistic execution allows CoNDA to gather information on the memory accesses performed by the NDA and by the rest of the system. CoNDA exploits this information to avoid performing unnecessary coherence requests, and thus, significantly reduces data movement for coherence. We evaluate CoNDA using state-of-the-art graph processing and hybrid in-memory database workloads. Averaged across all of our workloads operating on modest data set sizes, CoNDA improves performance by 19.6% over the highest-performance prior coherence mechanism (66.0%/51.7% over a CPU-only/NDA-only system) and reduces memory system energy consumption by 18.0% over the most energy-efficient prior coherence mechanism (43.7% over CPU- only). CoNDA comes within 10.4% and 4.4% of the performance and energy of an ideal mechanism with no cost for coherence. The benefits of CoNDA increase with large data sets, as CoNDA improves performance over the highest-performance prior coherence mechanism by 38.3% (8.4x/7.7x over CPU-only/NDA-only), and comes within 10.2% of an ideal no-cost coherence mechanism.

[1]  Anoop Gupta,et al.  Cache-coherent distributed shared memory: perspectives on its development and future challenges , 1999, Proc. IEEE.

[2]  Christoph Hagleitner,et al.  Designing a Programmable Wire-Speed Regular-Expression Matching Accelerator , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Peter M. Kogge,et al.  EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[4]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[5]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[6]  Onur Mutlu Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation , 2019, ACM Great Lakes Symposium on VLSI.

[7]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[8]  Babak Falsafi,et al.  Near-Memory Data Services , 2016, IEEE Micro.

[9]  Josep Torrellas,et al.  FlexRAM: Toward an advanced Intelligent Memory system: A retrospective paper , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[10]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  H. T. Kung,et al.  On optimistic concurrency control , 1981 .

[13]  Gu-Yeon Wei,et al.  MachSuite: Benchmarks for accelerator design and customized architectures , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Rachata Ausavarungnirun,et al.  RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Onur Mutlu,et al.  LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory , 2017, IEEE Computer Architecture Letters.

[16]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17]  Chun Chen,et al.  The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.

[18]  Mateo Valero,et al.  Implementing Kilo-Instruction Multiprocessors , 2005, ICPS '05. Proceedings. International Conference on Pervasive Services, 2005..

[19]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Sarita V. Adve,et al.  Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[22]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[23]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Josep Torrellas,et al.  Speculative synchronization: applying thread-level speculation to explicitly parallel applications , 2002, ASPLOS X.

[25]  Yafei Dai,et al.  Seraph: an efficient, low-cost system for concurrent graph processing , 2014, HPDC '14.

[26]  Andrew Pavlo,et al.  Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads , 2016, SIGMOD Conference.

[27]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[28]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, ISCA.

[29]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[30]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[31]  Mikko H. Lipasti,et al.  Architectural support for server-side PHP processing , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[32]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[33]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[34]  Gu-Yeon Wei,et al.  Mallacc: Accelerating Memory Allocation , 2017, ASPLOS.

[35]  Onur Mutlu,et al.  Understanding the Interactions of Workloads and DRAM Types: A Comprehensive Experimental Study , 2019, ArXiv.

[36]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[37]  Vijay Janapa Reddi,et al.  WebCore: Architectural support for mobile Web browsing , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[38]  Gabriel H. Loh,et al.  Leveraging near data processing for high-performance checkpoint/restart , 2017, SC.

[39]  Dinesh Das,et al.  Oracle Database In-Memory: A dual format in-memory database , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[40]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[41]  Kunle Olukotun,et al.  Data speculation support for a chip multiprocessor , 1998, ASPLOS VIII.

[42]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Sander Stuijk,et al.  NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[44]  Gustavo Alonso,et al.  BatchDB: Efficient Isolated Execution of Hybrid OLTP+OLAP Workloads for Interactive Applications , 2017, SIGMOD Conference.

[45]  Josep Torrellas,et al.  Hardware and software support for speculative execution of sequential binaries on a chip-multiprocessor , 1998, ICS '98.

[46]  Sai Prashanth Muralidhara,et al.  Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[48]  Onur Mutlu,et al.  Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[50]  Thomas F. Wenisch,et al.  Thin servers with smart pipes: designing SoC accelerators for memcached , 2013, ISCA.

[51]  Gu-Yeon Wei,et al.  Toward Cache-Friendly Hardware Accelerators , 2015 .

[52]  Antonia Zhai,et al.  A scalable approach to thread-level speculation , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[53]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[54]  Snehasish Kumar,et al.  Fusion: Design tradeoffs in coherent cache hierarchies for accelerators , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[55]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[56]  Babak Falsafi,et al.  Sort vs. Hash Join Revisited for Near-Memory Execution , 2015 .

[57]  Onur Mutlu,et al.  Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[58]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[59]  José González,et al.  A two-level directory architecture for highly scalable cc-NUMA multiprocessors , 2005, IEEE Transactions on Parallel and Distributed Systems.

[60]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[61]  Rachata Ausavarungnirun,et al.  Processing Data Where It Makes Sense: Enabling In-Memory Computation , 2019, Microprocess. Microsystems.

[62]  Jung Ho Ahn,et al.  NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[63]  Babak Falsafi,et al.  The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[64]  Dong Li,et al.  Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[65]  Onur Mutlu,et al.  Continuous runahead: Transparent hardware acceleration for memory intensive workloads , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[66]  Anastasia Ailamaki,et al.  The Case For Heterogeneous HTAP , 2017, CIDR.

[67]  Jaeha Kim,et al.  Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[68]  Gustavo Alonso,et al.  Histograms as a side effect of data movement for big data , 2014, SIGMOD Conference.

[69]  Daniel Sánchez,et al.  Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[70]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[71]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[72]  Felix Heide,et al.  IDEAL: Image DEnoising AcceLerator , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[73]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[74]  Michael Stonebraker,et al.  The VoltDB Main Memory DBMS , 2013, IEEE Data Eng. Bull..

[75]  David A. Wood,et al.  Synchronization Using Remote-Scope Promotion , 2015, ASPLOS.

[76]  Kevin Wilkinson,et al.  Janus: Transaction Processing of Navigation and Analytic Graph Queries on Many-core Servers , 2017, CIDR.

[77]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[78]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[79]  Jing Wang,et al.  Processing-in-Memory Enabled Graphics Processors for 3D Rendering , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[80]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[81]  Christoforos E. Kozyrakis,et al.  GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[82]  Brandon Lucia,et al.  DMP: deterministic shared memory multiprocessing , 2009, IEEE Micro.

[83]  Mahmut T. Kandemir,et al.  Data Movement Aware Computation Partitioning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[84]  David A. Wood,et al.  Crossing Guard: Mediating Host-Accelerator Coherence Interactions , 2017, ASPLOS.

[85]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[86]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[87]  Onur Mutlu,et al.  Fast Bulk Bitwise AND and OR in DRAM , 2015, IEEE Computer Architecture Letters.

[88]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[89]  Onur Mutlu,et al.  LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures , 2017, ArXiv.

[90]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[91]  Sanjay J. Patel,et al.  WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[92]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[93]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[94]  Harold S. Stone,et al.  A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.

[95]  L. Castedo,et al.  SAP HANA , 2014 .

[96]  Alfons Kemper,et al.  HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[97]  Tze Meng Low,et al.  3 D-Stacked Memory-Side Acceleration : Accelerator and System Design , 2014 .

[98]  Andrew A. Chien,et al.  UDP: A Programmable Accelerator for Extract-Transform-Load Workloads and More , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[99]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[100]  Manos Athanassoulis,et al.  Beyond the Wall: Near-Data Processing for Databases , 2015, DaMoN.

[101]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[102]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[103]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[104]  John Shalf,et al.  Computing beyond Moore's Law , 2015, Computer.

[105]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[106]  Mahmut T. Kandemir,et al.  Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[107]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[108]  Bradley C. Kuszmaul,et al.  Unbounded Transactional Memory , 2005, HPCA.

[109]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[110]  Tianshi Li,et al.  Demystifying Complex Workload-DRAM Interactions , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[111]  Onur Mutlu,et al.  Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface , 2015, ArXiv.

[112]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[113]  Thomas F. Wenisch,et al.  HARE: Hardware accelerator for regular expressions , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[114]  David A. Wood,et al.  Lazy release consistency for GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[115]  Rachata Ausavarungnirun,et al.  The Processing-in-Memory Paradigm: Mechanisms to Enable Adoption , 2018, Beyond-CMOS Technologies for Next Generation Computer Design.

[116]  Josep Torrellas,et al.  Automatically mapping code on an intelligent memory architecture , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[117]  Ramyad Hadidi,et al.  CAIRO , 2017, ACM Trans. Archit. Code Optim..

[118]  Gu-Yeon Wei,et al.  Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[119]  Gwangsun Kim,et al.  Toward Standardized Near-Data Processing with Unrestricted Data Placement for GPUs , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[120]  Jung Ho Ahn,et al.  Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[121]  Daniel Sánchez,et al.  Implementing Signatures for Transactional Memory , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[122]  Onur Mutlu,et al.  The Dirty-Block Index , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[123]  Babak Falsafi,et al.  Near-Memory Address Translation , 2016, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[124]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[125]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[126]  Babak Falsafi,et al.  Asynchronous Memory Access Chaining , 2015, Proc. VLDB Endow..

[127]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).