Hiding the Microsecond-Scale Latency of Storage-Class Memories with Duplexity

We are entering the “killer microsecond” era in data center applications [1]. Due to advances in processor, memory, storage, and networking technologies, events that stall execution increasingly fall in a microsecond-scale latency range [2, 3]. Storage-class memories, such as 3D Xpoint, are examples of such events that stall execution for single-digit microseconds. Whereas contemporary computing systems are effectively equipped with mechanisms to hide nanosecondand millisecond-scale stalls, they lack efficient support for microsecond-scale stalls [1]. Nanosecond-scale stalls are effectively hidden by microarchitectural mechanisms, such as Out-of-Order (OoO) execution and deep memory hierarchies, but these mechanisms are insufficient to hide microsecond-scale stalls. Conversely, operating systems use context switching to hide millisecond-scale latencies, such as when accessing disk. However, context switch overheads (5-20μs [4]) are within the same latency orders as microsecond-scale stalls, so they are not a plausible latency-hiding technique for the microsecond regime. Simultaneous multithreading (SMT) has been proposed to colocate latency-critical and batch threads on the same core so that the batch threads fill the utilization holes caused by brief I/O stalls or inter-request idle time [5, 6]. Already today, scale-out workloads deployed in data centers exhibit low CPU utilization due to lack of memory level parallelism and front-end inefficiencies, calling for more SMT threads even in the absence of μs-scale stalls [7]. As batch workloads also adopt mechanisms like storage-class memory or rack-scale disaggregation, these workloads, too, will incur such stalls. As a consequence, even more threads must be added to ensure that, at any time, there are enough unstalled threads to fill a core’s available execution bandwidth—the two threads offered by Intel’s hyper-threading are not nearly enough. Unfortunately, scaling SMT microarchitecture to support many more threads is prohibitive, due to high logic complexity, wire delay, limited register file (RF) capacity, and cache pressure/thrashing among threads. Moreover, as previous studies have shown [8], some SMT thread co-locations can have catastrophic impact on the tails of latency-critical threads, especially at high loads, due to contention for shared resources [9]. To avoid compromising the tail latency of critical threads due to SMT interference, we instead design Duplexity [10], a server architecture that seeks directly to address the killer-microsecond challenge—to fill in the μs-scale “holes” in threads’ execution schedules, which arise due to idleness and stalls, with useful execution, without impacting the tail latency of latency-critical threads. Duplexity is the first server architecture that aims to improve server utilization in the presence of μs-scale stalls, without sacrificing QoS and tail latency of micro-services. Our evaluation, using Gem5 [11] and BigHouse [12] simulation frameworks, demonstrates that Duplexity can improve core utilization by 4.8× and 1.9×, and iso-throughput 99th-percentile tail latency by 1.8× and 2.7×, on average, over a baseline OoO and an SMT-based server architecture, respectively.

[1]  Junjie Wu,et al.  BigHouse: A simulation infrastructure for data center systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[2]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[3]  Xi Yang,et al.  Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading , 2016, USENIX Annual Technical Conference.

[4]  Yale N. Patt,et al.  MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[5]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[6]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[7]  Thomas F. Wenisch,et al.  µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[8]  Lingjia Tang,et al.  SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Thomas F. Wenisch,et al.  The Queuing-First Approach for Tail Management of Interactive Services , 2019, IEEE Micro.

[10]  Thomas F. Wenisch,et al.  Enhancing Server Efficiency in the Face of Killer Microseconds , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Thomas F. Wenisch,et al.  Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.