Using Multicore Reuse Distance to Study Coherence Directories

Researchers have proposed numerous techniques to improve the scalability of coherence directories. The effectiveness of these techniques not only depends on application behavior, but also on the CPU's configuration, for example, its core count and cache size. As CPUs continue to scale, it is essential to explore the directory's application and architecture dependencies. However, this is challenging given the slow speed of simulators. While it is common practice to simulate different applications, previous research on directory designs have explored only a few—and in most cases, only one—CPU configuration, which can lead to an incomplete and inaccurate view of the directory's behavior. This article proposes to use multicore reuse distance analysis to study coherence directories. We develop a framework to extract the directory access stream from parallel least recently used (LRU) stacks, enabling rapid analysis of the directory's accesses and contents across both core count and cache size scaling. A key part of our framework is the notion of relative reuse distance between sharers, which defines sharing in a capacity-dependent fashion and facilitates our analyses along the data cache size dimension. We implement our framework in a profiler and then apply it to gain insights into the impact of multicore CPU scaling on directory behavior. Our profiling results show that directory accesses reduce by 3.3× when scaling the data cache size from 16KB to 1MB, despite an increase in sharing-based directory accesses. We also show that increased sharing caused by data cache scaling allows the portion of on-chip memory occupied by the directory to be reduced by 43.3%, compared to a reduction of only 2.6% when scaling the number of cores. And, we show certain directory entries exhibit high temporal reuse. In addition to gaining insights, we also validate our profile-based results, and find they are within 2--10% of cache simulations on average, across different validation experiments. Finally, we conduct four case studies that illustrate our insights on existing directory techniques. In particular, we demonstrate our directory occupancy insights on a Cuckoo directory; we apply our sharing insights to provide bounds on the size of Scalable Coherence Directories (SCD) and Dual-Grain Directories (DGD); and, we demonstrate our directory entry reuse insights on a multilevel directory design.

[1]  Rajeev Balasubramonian,et al.  Optimizing communication and capacity in a 3D stacked reconfigurable cache hierarchy , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[2]  Erik Hagersten,et al.  A statistical multiprocessor cache model , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[3]  Donald Yeung,et al.  Studying the impact of multicore processor scaling on directory techniques via reuse distance analysis , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[4]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[5]  Deborah A. Wallach PHD: A Hierarchical Cache Coherent Protocol , 1992 .

[6]  Luiz André Barroso,et al.  Piranha: a scalable architecture based on single-chip multiprocessing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[7]  Berkin Özisikyilmaz,et al.  MineBench: A Benchmark Suite for Data Mining Workloads , 2006, 2006 IEEE International Symposium on Workload Characterization.

[8]  Bill Moyer,et al.  A low power unified cache architecture providing power and performance flexibility , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[9]  Sanjay J. Patel,et al.  WAYPOINT: scaling coherence to thousand-core architectures , 2010, PACT '10.

[10]  José González,et al.  A new scalable directory architecture for large-scale multiprocessors , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[11]  Donald Yeung,et al.  Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[12]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[13]  Christoforos E. Kozyrakis,et al.  SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[14]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[15]  Antonio Robles,et al.  Increasing the Effectiveness of Directory Caches by Avoiding the Tracking of Noncoherent Memory Blocks , 2013, IEEE Transactions on Computers.

[16]  YeungDonald,et al.  Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs , 2013 .

[17]  Guoying Chen SLID - A Cost-Effektive and Scalable Limited-Directory Scheme for Cache Coherence , 1993, PARLE.

[18]  Antonio Robles,et al.  Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[19]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[20]  Kyu Ho Park,et al.  Segment directory enhancing the limited directory cache coherence schemes , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[21]  Chen Ding,et al.  A Composable Model for Analyzing Locality of Multi-threaded Programs , 2009 .

[22]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[23]  R. Iyer,et al.  Performance , Area and Bandwidth Implications on Large-scale CMP Cache Design , 2007 .

[24]  Dong-Sheng Wang,et al.  Hierarchical Cache Directory for CMP , 2010, Journal of Computer Science and Technology.

[25]  Ozalp Babaoglu,et al.  ACM Transactions on Computer Systems , 2007 .

[26]  H. Buchan,et al.  Increasing effectiveness. , 1999, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[27]  David J. Lilja,et al.  Proceedings of the 21st international conference on Parallel architectures and compilation techniques , 2012, PACT 2012.

[28]  Collin McCurdy,et al.  Using Pin as a memory reference generator for multiprocessor simulation , 2005, CARN.

[29]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[30]  Babak Falsafi,et al.  Multi-grain coherence directories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[32]  Mohammad Alisafaee Spatiotemporal Coherence Tracking , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[34]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  David Eklov,et al.  Fast modeling of shared caches in multicore systems , 2011, HiPEAC.

[36]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[37]  José Duato,et al.  PS-Dir: A scalable two-level directory cache , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[40]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[41]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[42]  Srihari Makineni,et al.  Exploring the cache design space for large scale CMPs , 2005, CARN.

[43]  Babak Falsafi,et al.  Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[44]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.