Jumanji: The Case for Dynamic NUCA in the Datacenter

The datacenter introduces new challenges for computer systems around tail latency and security. This paper argues that dynamic NUCA techniques are a better solution to these challenges than prior cache designs. We show that dynamic NUCA designs can meet tail-latency deadlines with much less cache space than prior work, and that they also provide a natural defense against cache attacks. Unfortunately, prior dynamic NUCAs have missed these opportunities because they focus exclusively on reducing data movement.We present Jumanji, a dynamic NUCA technique designed for tail latency and security. We show that prior last-level cache designs are vulnerable to new attacks and offer imperfect performance isolation. Jumanji solves these problems while significantly improving performance of co-running batch applications. Moreover, Jumanji only requires lightweight hardware and a few simple changes to system software, similar to prior D-NUCAs. At 20 cores, Jumanji improves batch weighted speedup by 14% on average, vs. just 2% for a non-NUCA design with weaker security, and is within 2% of an idealized design.

[1]  Ronak Singhal,et al.  Inside Intel® Core microarchitecture (Nehalem) , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[2]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[3]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[4]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[5]  Dan Page,et al.  Partitioned Cache Architecture as a Side-Channel Defence Mechanism , 2005, IACR Cryptology ePrint Archive.

[6]  Sangyeun Cho,et al.  SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[7]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[8]  Eric Rotenberg,et al.  Jigsaw: Scalable software-defined caches , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[9]  Daniel Sánchez,et al.  Scaling distributed cache hierarchies through computation and data co-scheduling , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[10]  Yan Solihin,et al.  A Framework for Providing Quality of Service in Chip Multi-Processors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Yingwei Luo,et al.  DCAPS: dynamic cache allocation with partial sharing , 2018, EuroSys.

[13]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[14]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[15]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[16]  Cesar Pereida García,et al.  Port Contention for Fun and Profit , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[17]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[18]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[19]  Adi Shamir,et al.  Cache Attacks and Countermeasures: The Case of AES , 2006, CT-RSA.

[20]  Yang Li,et al.  dCat: dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service , 2018, EuroSys.

[21]  Daniel Sánchez,et al.  Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[23]  Daniel Sánchez,et al.  Whirlpool: Improving Dynamic Cache Management with Static Data Classification , 2016, ASPLOS.

[24]  Amir Roth,et al.  FIESTA: A Sample-Balanced Multi-Program Workload Methodology , 2009 .

[25]  Babak Falsafi,et al.  SMoTherSpectre: Exploiting Speculative Execution through Port Contention , 2019, CCS.

[26]  Michael Hamburg,et al.  Meltdown: Reading Kernel Memory from User Space , 2018, USENIX Security Symposium.

[27]  Michael Hamburg,et al.  Spectre Attacks: Exploiting Speculative Execution , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[28]  Ruby B. Lee,et al.  New cache designs for thwarting software cache-based side channel attacks , 2007, ISCA '07.

[29]  Rami G. Melhem,et al.  Practical PACE for embedded systems , 2004, EMSOFT '04.

[30]  Daniel Mossé,et al.  Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[31]  Valentin Puente,et al.  ESP-NUCA: A low-cost adaptive Non-Uniform Cache Architecture , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[32]  Omer Khan,et al.  IRONHIDE: A Secure Multicore that Efficiently Mitigates Microarchitecture State Attacks for Interactive Applications , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[33]  Josep Torrellas,et al.  Speculative Taint Tracking (STT): A Comprehensive Protection for Speculatively Accessed Data , 2019, IEEE Micro.

[34]  Josep Torrellas,et al.  InvisiSpec: Making Speculative Execution Invisible in the Cache Hierarchy , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Daniel Sánchez,et al.  Jenga: Software-defined cache hierarchies , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[36]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[37]  Alan Jay Smith,et al.  Improving dynamic voltage scaling algorithms with PACE , 2001, SIGMETRICS '01.

[38]  Nick Knupffer Intel Corporation , 2018, The Grants Register 2019.

[39]  José González,et al.  Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors , 2010, ISCA.

[40]  Boris Grot,et al.  Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[41]  Zeshan Chishti,et al.  Optimizing replication, communication, and capacity allocation in CMPs , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[42]  David Meisner,et al.  Stochastic Queuing Simulation for Data Center Workloads , 2010 .

[43]  Ruby B. Lee,et al.  Random Fill Cache Architecture , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[44]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[45]  Thu D. Nguyen,et al.  Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[46]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[47]  Mainak Chaudhuri PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[48]  NahrstedtKlara,et al.  Energy-efficient soft real-time CPU scheduling for mobile multimedia systems , 2003 .

[49]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[50]  Moinuddin K. Qureshi Adaptive Spill-Receive for robust high-performance caching in CMPs , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[51]  Tao Li,et al.  Optimizing virtual machine consolidation performance on NUMA server architecture for cloud workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[52]  Hao Wu,et al.  Newcache: Secure Cache Architecture Thwarting Cache Side-Channel Attacks , 2016, IEEE Micro.

[53]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[54]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[55]  Taesoo Kim,et al.  STEALTHMEM: System-Level Protection Against Cache-Based Side Channel Attacks in the Cloud , 2012, USENIX Security Symposium.

[56]  R. Govindarajan,et al.  Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[57]  David Kaeli,et al.  Exploiting Bank Conflict-based Side-channel Timing Leakage of GPUs , 2019, ACM Trans. Archit. Code Optim..

[58]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[59]  Christina Delimitrou,et al.  PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services , 2019, ASPLOS.

[60]  Gernot Heiser,et al.  Last-Level Cache Side-Channel Attacks are Practical , 2015, 2015 IEEE Symposium on Security and Privacy.

[61]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[62]  Mehmet Kayaalp,et al.  A high-resolution side-channel attack on last-level cache , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[63]  Moinuddin K. Qureshi CEASER: Mitigating Conflict-Based Cache Attacks via Encrypted-Address and Remapping , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[64]  Daniel Sánchez,et al.  Nexus: A New Approach to Replication in Distributed Shared Caches , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[65]  Lizhong Chen,et al.  Futility Scaling: High-Associativity Cache Partitioning , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[66]  Gernot Heiser,et al.  CATalyst: Defeating last-level cache side channel attacks in cloud computing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[67]  Stephan Krenn,et al.  Cache Games -- Bringing Access-Based Cache Attacks on AES to Practice , 2011, 2011 IEEE Symposium on Security and Privacy.

[68]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[69]  Xiaosong Ma,et al.  KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[70]  Moinuddin K. Qureshi New Attacks and Defense for Encrypted-Address Cache , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[71]  Josep Torrellas,et al.  MicroScope: Enabling Microarchitectural Replay Attacks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[72]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[73]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[74]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[75]  Klara Nahrstedt,et al.  Energy-efficient soft real-time CPU scheduling for mobile multimedia systems , 2003, SOSP '03.

[76]  Mehmet Kayaalp,et al.  RIC: Relaxed Inclusion Caches for mitigating LLC side-channel attacks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[77]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[78]  Francisco J. Cazorla,et al.  FlexDCP: a QoS framework for CMP architectures , 2009, OPSR.

[79]  Daniel Sánchez,et al.  Talus: A simple way to remove cliffs in cache performance , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[80]  Rajeev Balasubramonian,et al.  Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[81]  Yunsi Fei,et al.  A novel cache bank timing attack , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[82]  Paul M. Carpenter,et al.  Hipster: Hybrid Task Manager for Latency-Critical Cloud Workloads , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[83]  Irfan Ahmad,et al.  Cache Modeling and Optimization using Miniature Simulations , 2017, USENIX Annual Technical Conference.

[84]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[85]  Ronald G. Dreslinski,et al.  Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[86]  Ying Ye,et al.  COLORIS: A dynamic cache partitioning system using page coloring , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[87]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[88]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[89]  Vijay S. Pai,et al.  Imbalanced cache partitioning for balanced data-parallel programs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[90]  Josep Torrellas,et al.  Secure hierarchy-aware cache replacement policy (SHARP): Defending against cache-based side channel attacks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).