SCD: A scalable coherence directory with flexible sharer set encoding

Large-scale CMPs with hundreds of cores require a directory-based protocol to maintain cache coherence. However, previously proposed coherence directories are hard to scale beyond tens of cores, requiring either excessive area or energy, complex hierarchical protocols, or inexact representations of sharer sets that increase coherence traffic and degrade performance. We present SCD, a scalable coherence directory that relies on efficient highly-associative caches (such as zcaches) to implement a single-level directory that scales to thousands of cores, tracks sharer sets exactly, and incurs negligible directory-induced invalidations. SCD scales because, unlike conventional directories, it uses a variable number of directory tags to represent sharer sets: lines with one or few sharers use a single tag, while widely shared lines use additional tags, so tags remain small as the system scales up. We show that, thanks to the efficient highly-associative array it relies on, SCD can be fully characterized using analytical models, and can be sized to guarantee a negligible number of evictions independently of the workload. We evaluate SCD using simulations of a 1024-core CMP. For the same level of coverage, we find that SCD is 13× more area-efficient than full-map sparse directories, and 2× more area-efficient and faster than hierarchical directories, while requiring a simpler protocol. Furthermore, we show that SCD's analytical models are accurate in practice.

[1]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[2]  Torvald Riegel,et al.  Optimizing hybrid transactional memory: the importance of nonspeculative operations , 2011, SPAA '11.

[3]  Sanjay J. Patel,et al.  WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[4]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[5]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[6]  Natalie D. Enright Jerger,et al.  Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[7]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[8]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[10]  Rasmus Pagh,et al.  Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[11]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[12]  Nir Shavit,et al.  Transactional Locking II , 2006, DISC.

[13]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[14]  Belliappa Kuttanna,et al.  A Sub-1W to 2W Low-Power IA Processor for Mobile Internet Devices and Ultra-Mobile PCs in 45nm Hi-Κ Metal Gate CMOS , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[15]  Victor Pankratius,et al.  A study of transactional memory vs. locks in practice , 2011, SPAA '11.

[16]  Tudor David,et al.  Everything you always wanted to know about synchronization but were afraid to ask , 2013, SOSP.

[17]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[18]  Andreas Moshovos,et al.  A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[19]  Dong-Sheng Wang,et al.  Hierarchical Cache Directory for CMP , 2010, Journal of Computer Science and Technology.

[20]  Nir Shavit,et al.  Reduced hardware transactions: a new approach to hybrid transactional memory , 2013, SPAA.

[21]  Rodolfo Azevedo,et al.  Characterizing the Energy Consumption of Software Transactional Memory , 2009, IEEE Computer Architecture Letters.

[22]  Sanjay J. Patel,et al.  WayPoint: Scaling coherence to 1000-core architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[24]  Sean White,et al.  Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory , 2011, ASPLOS XVI.

[25]  Yujie Liu,et al.  Transactionalizing legacy code: an experience report using GCC and Memcached , 2014, ASPLOS.

[26]  Roberto Palmieri,et al.  On the analytical modeling of concurrency control algorithms for Software Transactional Memories: The case of Commit-Time-Locking , 2012, Perform. Evaluation.

[27]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[28]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[29]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[30]  Maged M. Michael,et al.  Robust architectural support for transactional memory in the power architecture , 2013, ISCA.

[31]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[32]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[33]  Michael F. Spear,et al.  NOrec: streamlining STM by abolishing ownership records , 2010, PPoPP '10.

[34]  Mark Moir,et al.  Early experience with a commercial hardware transactional memory implementation , 2009, ASPLOS.

[35]  Laxmi N. Bhuyan,et al.  Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors , 1992, IEEE Trans. Parallel Distributed Syst..

[36]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[37]  Maged M. Michael,et al.  Evaluation of Blue Gene/Q hardware support for transactional memories , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[39]  João P. Cachopo,et al.  Practical Parallel Nesting for Software Transactional Memory , 2013, DISC.

[40]  Larry Carter,et al.  Universal classes of hash functions (Extended Abstract) , 1977, STOC '77.

[41]  Nuno Diegues,et al.  Self-Tuning Intel Transactional Synchronization Extensions , 2014, ICAC.

[42]  Christoforos E. Kozyrakis,et al.  The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[43]  Aamer Jaleel,et al.  Last level cache (LLC) performance of data mining workloads on a CMP - a case study of parallel bioinformatics workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[44]  Sandya Mannarswamy,et al.  Compiler aided selective lock assignment for improving the performance of software transactional memory , 2010, PPoPP '10.

[45]  Armin Heindl,et al.  An analytic framework for performance modeling of software transactional memory , 2009, Comput. Networks.

[46]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[47]  Bruno Ciciani,et al.  Machine Learning-Based Self-Adjusting Concurrency in Software Transactional Memory Systems , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[48]  Yu Yang,et al.  Efficient methods for formally verifying safety properties of hierarchical cache coherence protocols , 2010, Formal Methods Syst. Des..

[49]  Torvald Riegel,et al.  Dynamic performance tuning of word-based software transactional memory , 2008, PPoPP.

[50]  Christoforos E. Kozyrakis,et al.  Scalable and Efficient Fine-Grained Cache Partitioning with Vantage , 2012, IEEE Micro.

[51]  Maurice Herlihy,et al.  Embedded-TM: Energy and complexity-effective hardware transactional memory for embedded multicore systems , 2010, J. Parallel Distributed Comput..

[52]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[54]  Sandhya Dwarkadas,et al.  SPACE: Sharing pattern-based directory coherence for multicore scalability , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[55]  Massimo Poncino,et al.  Energy-optimal synchronization primitives for single-chip multi-processors , 2009, GLSVLSI '09.

[56]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[57]  Nir Shavit,et al.  Software transactional memory , 1995, PODC '95.

[58]  Hermann Härtig,et al.  Measuring energy consumption for short code paths using RAPL , 2012, PERV.

[59]  R. Govindarajan,et al.  Emulating Optimal Replacement with a Shepherd Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[60]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[61]  Deborah A. Wallach PHD: A Hierarchical Cache Coherent Protocol , 1992 .

[62]  Yehuda Afek,et al.  Programming with hardware lock elision , 2013, PPoPP '13.

[63]  Yu Yang,et al.  Reducing Verification Complexity of a Multicore Coherence Protocol Using Assume/Guarantee , 2006, 2006 Formal Methods in Computer Aided Design.

[64]  Wolfgang E. Nagel,et al.  Power measurement techniques on standard compute nodes: A quantitative comparison , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[65]  Torvald Riegel,et al.  Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack , 2010, EuroSys '10.

[66]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[67]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[68]  Mikko H. Lipasti,et al.  Improving multiprocessor performance with coarse-grain coherence tracking , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[69]  Rachid Guerraoui,et al.  Stretching transactional memory , 2009, PLDI '09.

[70]  Nuno Diegues,et al.  Time-warp: lightweight abort minimization in transactional memory , 2014, PPoPP '14.

[71]  Ha Pham,et al.  A 40nm 16-core 128-thread CMT SPARC SoC processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[72]  Manuel E. Acacio,et al.  On the design of energy‐efficient hardware transactional memory systems , 2013, Concurr. Comput. Pract. Exp..

[73]  Michael Mitzenmacher,et al.  More Robust Hashing: Cuckoo Hashing with a Stash , 2008, ESA.

[74]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[75]  Shankar Balachandran,et al.  The Implications of Shared Data Synchronization Techniques on Multi-Core Energy Efficiency , 2012, HotPower.