Concurrent Data Structures with Near-Data-Processing: an Architecture-Aware Implementation

Recent advances in memory architectures have provoked renewed interest in near-data-processing (NDP) as way to alleviate the "memory wall" problem. An NDP architecture places logic circuits, such as simple processors, in close proximity to memory. Effective use of NDP architectures requires rethinking data structures and their algorithms. Here, we provide an empirical evaluation of several NDP-aware algorithms for general-purpose concurrent data structures such as linked-lists, skiplists, and FIFO queues. The empirical analysis reveals that the potential benefits of NDP-based concurrent data structures are less than what had been expected in earlier studies. In turn, we introduce lightweight NDP hardware modifications, inspired by initial observations on data access patterns and underlying DRAM activity. Even the minimal changes to hardware significantly improve the performance and energy consumption of NDP-based concurrent data structures, and in many cases, the resulting data structures outperform state-of-the-art concurrent data structures.

[1]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Onur Mutlu,et al.  GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies , 2017, BMC Genomics.

[3]  Kees G. W. Goossens,et al.  Improved Power Modeling of DDR SDRAMs , 2011, 2011 14th Euromicro Conference on Digital System Design.

[4]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[5]  Scott A. Mahlke,et al.  In-Memory Data Parallel Processor , 2018, ASPLOS.

[6]  Christoforos E. Kozyrakis,et al.  GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Song Jiang,et al.  Wormhole: A Fast Ordered Index for In-memory Data Management , 2018 .

[8]  Henri-Pierre Charles,et al.  Micro-architectural simulation of embedded core heterogeneity with gem5 and McPAT , 2015, RAPIDO '15.

[9]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[10]  Maurice Herlihy,et al.  Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.

[11]  Ki-Seok Chung,et al.  HMC-MAC: Processing-in Memory Architecture for Multiply-Accumulate Operations with Hybrid Memory Cube , 2018, IEEE Computer Architecture Letters.

[12]  Luigi Carro,et al.  Processing in 3D memories to speed up operations on complex data structures , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Feifei Li,et al.  NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[14]  Rachata Ausavarungnirun,et al.  Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[15]  Kiyoung Choi,et al.  PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[16]  William Pugh,et al.  Skip Lists: A Probabilistic Alternative to Balanced Trees , 1989, WADS.

[17]  Onur Mutlu,et al.  Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[18]  Luca Benini,et al.  Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube , 2016, ARCS.

[19]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[20]  Hyesoon Kim,et al.  Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals , 2015, MEMSYS.

[21]  Eric Ruppert,et al.  Lock-free linked lists and skip lists , 2004, PODC '04.

[22]  Ramyad Hadidi,et al.  GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Onur Mutlu,et al.  LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory , 2017, IEEE Computer Architecture Letters.

[24]  Christoforos E. Kozyrakis,et al.  Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[25]  Reetuparna Das,et al.  Exploring specialized near-memory processing for data intensive operations , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[27]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[28]  Jung Ho Ahn,et al.  Accelerating linked-list traversal through near-data processing , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[29]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[30]  Maurice Herlihy,et al.  A Lazy Concurrent List-Based Set Algorithm , 2007, Parallel Process. Lett..

[31]  Nir Shavit,et al.  Flat combining and the synchronization-parallelism tradeoff , 2010, SPAA '10.

[32]  Mike Ignatowski,et al.  TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.

[33]  Hemangee K. Kapoor,et al.  Towards Near Data Processing of Convolutional Neural Networks , 2018, 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID).

[34]  Nam Sung Kim,et al.  Fine-Grained Task Migration for Graph Algorithms Using Processing in Memory , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[35]  Tajana Simunic,et al.  GenPIM: Generalized processing in-memory to accelerate data intensive applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Guangyu Sun,et al.  PM3: Power Modeling and Power Management for Processing-in-Memory , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).