论文信息 - SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficient synchronization among the NDP cores of a system is necessary. However, supporting synchronization in many NDP systems is challenging because they lack shared caches and hardware cache coherence support, which are commonly used for synchronization in multicore systems, and communication across different NDP units can be expensive. This paper comprehensively examines the synchronization problem in NDP systems, and proposes SynCron, an endto-end synchronization solution for NDP systems. SynCron adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support. SynCron has three components: 1) a specialized cache memory structure to avoid memory accesses for synchronization and minimize latency overheads, 2) a hierarchical message-passing communication protocol to minimize expensive communication across NDP units of the system, and 3) a hardware-only overflow management scheme to avoid performance degradation when hardware resources for synchronization tracking are exceeded. We evaluate SynCron using a variety of parallel workloads, covering various contention scenarios. SynCron improves performance by 1.27× on average (up to 1.78×) under highcontention scenarios, and by 1.35× on average (up to 2.29×) under low-contention real applications, compared to state-ofthe-art approaches. SynCron reduces system energy consumption by 2.08× on average (up to 4.25×).

[1] Onur Mutlu,et al. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[2] Rachata Ausavarungnirun,et al. Processing Data Where It Makes Sense: Enabling In-Memory Computation , 2019, Microprocess. Microsystems.

[3] Dan Alistarh,et al. The SprayList: a scalable relaxed priority queue , 2015, PPoPP.

[4] D. Lenoski,et al. The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[5] Josep Torrellas,et al. WiSync: An Architecture for Fast Synchronization through On-Chip Wireless Communication , 2016, ASPLOS.

[6] Sanjay J. Patel,et al. Cohesion: a hybrid memory model for accelerators , 2010, ISCA.

[7] Babak Falsafi,et al. The mondrian data engine , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[8] Zhimin Zhang,et al. Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Co-Design Approach , 2019, MICRO.

[9] Dong Li,et al. Processing-in-Memory for Energy-Efficient Neural Network Training: A Heterogeneous Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11] David R. Kaeli,et al. HQL: A Scalable Synchronization Mechanism for GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[12] Milos Prvulovic,et al. MiSAR: Minimalistic synchronization accelerator with resource overflow management , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[13] Norman P. Jouppi,et al. Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14] Christoforos E. Kozyrakis,et al. ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[15] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[16] Onur Mutlu,et al. Data marshaling for multi-core architectures , 2010, ISCA.

[17] Ralph Grishman,et al. The NYU ultracomputer—designing a MIMD, shared-memory parallel machine , 2018, ISCA '98.

[18] Christoforos E. Kozyrakis,et al. GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19] Mingyu Gao,et al. HRL: Efficient and flexible reconfigurable logic for near-data processing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20] Maya Gokhale,et al. Near memory data structure rearrangement , 2015, MEMSYS.

[21] Dimin Niu,et al. iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[22] Sander Stuijk,et al. NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).

[23] William J. Dally,et al. The message-driven processor: a multicomputer processing node with efficient mechanisms , 1992, IEEE Micro.

[24] Stefanos Kaxiras,et al. Callback: Efficient synchronization without invalidation with a directory just for spin-waiting , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[25] Sander Stuijk,et al. NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[26] Rachata Ausavarungnirun,et al. A Modern Primer on Processing in Memory , 2020, ArXiv.

[27] Daniel Sánchez,et al. Adaptive Scheduling for Systems with Asymmetric Memory Hierarchies , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28] Kunle Olukotun,et al. Simplifying Scalable Graph Processing with a Domain-Specific Language , 2014, CGO '14.

[29] Wenguang Chen,et al. pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing , 2019, ASPLOS.

[30] Matthias S. Müller,et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[31] Christoforos E. Kozyrakis,et al. Practical Near-Data Processing for In-Memory Analytics Frameworks , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[32] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[33] Dipl.-Inf. Torsten Hoefler,et al. A Survey of Barrier Algorithms for Coarse Grained Supercomputers , 2005 .

[34] R. E. Kessler,et al. Cray T3D: a new dimension for Cray Research , 1993, Digest of Papers. Compcon Spring.

[35] Reena Panda,et al. Data partitioning strategies for graph workloads on heterogeneous clusters , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36] Haibo Chen,et al. Scalable Adaptive NUMA-Aware Lock , 2017, IEEE Transactions on Parallel and Distributed Systems.

[37] Onur Mutlu,et al. Utility-based acceleration of multithreaded applications on asymmetric CMPs , 2013, ISCA.

[38] Allan Porterfield,et al. The Tera computer system , 1990, ICS '90.

[39] Henk Corporaal,et al. Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.

[40] Nectarios Koziris,et al. Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures , 2019, SC.

[41] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[42] Jaeha Kim,et al. Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[43] Tudor David,et al. Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[44] Kiyoung Choi,et al. A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[45] Anoop Gupta,et al. A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols , 2022 .

[46] Omer Khan,et al. CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores , 2015, 2015 IEEE International Symposium on Workload Characterization.

[47] Oscar Plata,et al. NATSA: A Near-Data Processing Accelerator for Time Series Analysis , 2020, 2020 IEEE 38th International Conference on Computer Design (ICCD).

[48] Nectarios Koziris,et al. An adaptive concurrent priority queue for NUMA architectures , 2019, CF.

[49] Larry Rudolph,et al. Dynamic decentralized cache schemes for mimd parallel processors , 1984, ISCA '84.

[50] Sudhakar Yalamanchili,et al. Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[51] Sander Stuijk,et al. Near-Memory Computing: Past, Present, and Future , 2019, Microprocess. Microsystems.

[52] Jaejin Lee,et al. 25.2 A 1.2V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[53] Maurice Herlihy,et al. Concurrent Data Structures for Near-Memory Computing , 2017, SPAA.

[54] Josep Torrellas,et al. Survive: Pointer-Based In-DRAM Incremental Checkpointing for Low-Cost Data Persistence and Rollback-Recovery , 2017, IEEE Computer Architecture Letters.

[55] Robert Tappan Morris,et al. An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[56] Onur Mutlu,et al. Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[57] Fabrice Devaux,et al. The true Processing In Memory accelerator , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[58] Milos Prvulovic,et al. TLSync: Support for multiple fast barriers using on-chip transmission lines , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[59] P.T. Wolkotte,et al. Energy Model of Networks-on-Chip and a Bus , 2005, 2005 International Symposium on System-on-Chip.

[60] Vladimir Vlassov,et al. Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server , 2015, 2015 IEEE Fifth International Conference on Big Data and Cloud Computing.

[61] Dirk Grunwald,et al. Efficient barriers for distributed shared memory computers , 1994, Proceedings of 8th International Parallel Processing Symposium.

[62] Maurice Herlihy,et al. Concurrent Data Structures with Near-Data-Processing: an Architecture-Aware Implementation , 2019, SPAA.

[63] Michael L. Scott,et al. Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[64] Nectarios Koziris,et al. Combining HTM with RCU to Speed Up Graph Coloring on Multicore Platforms , 2018, ISC.

[65] Niraj K. Jha,et al. GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[66] Onur Mutlu,et al. Demystifying Complex Workload-DRAM Interactions: An Experimental Study , 2019, SIGMETRICS.

[67] Tejas Karkhanis,et al. Active Memory Cube: A processing-in-memory architecture for exascale systems , 2015, IBM J. Res. Dev..

[68] U. Narayan Bhat,et al. An Introduction to Queueing Theory: Modeling and Analysis in Applications , 2006 .

[69] Mary K. Vernon,et al. Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[70] G. S. Graham. A New Solution of Dijkstra ' s Concurrent Programming Problem , 2022 .

[71] Traviss. Craig,et al. Building FIFO and Priority-Queuing Spin Locks from Atomic Swap , 1993 .

[72] José Ignacio Benavides Benítez,et al. Performance Modeling of Atomic Additions on GPU Scratchpad Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.

[73] Vivien Quéma,et al. Multicore Locks: The Case Is Not Closed Yet , 2016, USENIX Annual Technical Conference.

[74] Onur Mutlu,et al. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[75] José L. Abellán,et al. GLocks: Efficient Support for Highly-Contended Locks in Many-Core CMPs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[76] Onur Mutlu,et al. Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[77] Kai Wang,et al. Fast Fine-Grained Global Synchronization on GPUs , 2019, ASPLOS.

[78] Stefanos Kaxiras,et al. SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.

[79] Stefanos Kaxiras,et al. Turning Centralized Coherence and Distributed Critical-Section Execution on their Head: A New Approach for Scalable Distributed Shared Memory , 2015, HPDC.

[80] Onur Mutlu,et al. GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies , 2017, BMC Genomics.

[81] William J. Dally,et al. Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[82] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[83] Feifei Li,et al. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[84] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[85] Maleen Abeydeera,et al. Chronos: Efficient Speculative Parallelism for Accelerators , 2020, ASPLOS.

[86] D. M. Hutton,et al. The Art of Multiprocessor Programming , 2008 .

[87] Mateo Valero,et al. Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[88] Rachid Guerraoui,et al. Optimistic concurrency with OPTIK , 2016, PPOPP.

[89] Liu Liu,et al. Leveraging 3D technologies for hardware security: Opportunities and challenges , 2016, 2016 International Great Lakes Symposium on VLSI (GLSVLSI).

[90] Zhen Fang,et al. Highly efficient synchronization based on active memory operations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[91] Debra Hensgen,et al. Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[92] Tudor David,et al. Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures , 2015, ASPLOS.

[93] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[94] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[95] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[96] Sarita V. Adve,et al. DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[97] Rachata Ausavarungnirun,et al. CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[98] Onur Mutlu,et al. Bottleneck identification and scheduling in multithreaded applications , 2012, ASPLOS XVII.

[99] Onur Mutlu,et al. Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[100] Erik Hagersten,et al. Queue locks on cache coherent multiprocessors , 1994, Proceedings of 8th International Parallel Processing Symposium.

[101] Gerard J. M. Smit,et al. Portable Memory Consistency for Software Managed Distributed Memory in Many-Core SoC , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[102] Nathan R. Tallent,et al. Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.

[103] Donald Yeung,et al. The MIT Alewife machine: architecture and performance , 1995, ISCA '98.

[104] Tor M. Aamodt,et al. Warp Scheduling for Fine-Grained Synchronization , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[105] Onur Mutlu,et al. SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.

[106] Nir Shavit,et al. Flat-combining NUMA locks , 2011, SPAA '11.

[107] Nathan Beckmann,et al. PHI: Architectural Support for Synchronization- and Bandwidth-Efficient Commutative Scatter Updates , 2019, MICRO.

[108] Nectarios Koziris,et al. RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[109] Thomas E. Anderson,et al. The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.

[110] Eamonn J. Keogh,et al. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[111] Harry F. Jordan. Performance measurements on HEP - a pipelined MIMD computer , 1983, ISCA '83.

[112] Timothy A. Davis,et al. The university of Florida sparse matrix collection , 2011, TOMS.

[113] Michael L. Scott,et al. Synchronization without contention , 1991, ASPLOS IV.

[114] Gu-Yeon Wei,et al. Co-designing accelerators and SoC interfaces using gem5-Aladdin , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[115] David A. Wood,et al. Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[116] John M. Mellor-Crummey,et al. High performance locks for multi-level NUMA systems , 2015, PPoPP.

[117] Emmett Kilgariff,et al. Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[118] Yanzhi Wang,et al. GraphQ: Scalable PIM-Based Graph Processing , 2019, MICRO.

[119] Eran Yahav,et al. Practical concurrent binary search trees via logical ordering , 2014, PPoPP '14.

[120] David R. Kaeli,et al. Profiling DNN Workloads on a Volta-based DGX-1 System , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[121] Rachata Ausavarungnirun,et al. GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[122] Stefanos Kaxiras,et al. A new perspective for efficient virtual-cache coherence , 2013, ISCA.

[123] Michael L. Scott,et al. Non-blocking timeout in scalable queue-based spin locks , 2002, PODC '02.

[124] Steven L. Scott,et al. Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[125] Maged M. Michael,et al. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[126] Chris Fallin,et al. Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[127] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[128] Onur Mutlu,et al. Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[129] B J Smith,et al. A pipelined, shared resource MIMD computer , 1986 .

[130] Steven Swanson,et al. Near-Data Processing: Insights from a MICRO-46 Workshop , 2014, IEEE Micro.

[131] W. Daniel Hillis,et al. The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[132] Rachata Ausavarungnirun,et al. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.

[133] Sanjay J. Patel,et al. WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[134] Sudhakar Yalamanchili,et al. Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[135] Jaehwan Lee,et al. A system-on-a-chip lock cache with task preemption support , 2001, CASES '01.

[136] Nir Shavit,et al. A Hierarchical CLH Queue Lock , 2006, Euro-Par.

[137] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[138] Onur Mutlu,et al. Processing-in-memory: A workload-driven perspective , 2019, IBM J. Res. Dev..

[139] Dominique Lavenier,et al. DNA mapping using Processor-in-Memory architecture , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[140] Onur Mutlu,et al. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory , 2017, IEEE Computer Architecture Letters.

[141] José L. Abellán,et al. A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs , 2010, 2010 39th International Conference on Parallel Processing.

[142] William Pugh,et al. Concurrent maintenance of skip lists , 1990 .

[143] Volker Lohweg,et al. Survey on time series motif discovery , 2017, WIREs Data Mining Knowl. Discov..

[144] Mahmut T. Kandemir,et al. Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[145] Vladimir Vlassov,et al. Node Architecture Implications for In-Memory Data Analytics on Scale-in Clusters , 2016, 2016 IEEE/ACM 3rd International Conference on Big Data Computing Applications and Technologies (BDCAT).