论文信息 - Enhancing Server Efficiency in the Face of Killer Microseconds

Enhancing Server Efficiency in the Face of Killer Microseconds

We are entering an era of “killer microseconds” in data center applications. Killer microseconds refer to μs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of μs-scale stalls. Simultaneous Multithreading (SMT) is an efficient way to improve core utilization and increase server performance density. Unfortunately, scaling SMT to provision enough threads to hide frequent μs-scale stalls is prohibitive and SMT co-location can often drastically increase the tail latency of cloud microservices. In this paper, we propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity provisions dyads (pairs) of two kinds of cores: master-cores, which each primarily executes a single latency-critical master-thread, and lender-cores, which multiplex latency-insensitive throughput threads. When the master-thread stalls, the master-core borrows filler-threads from the lender-core, filling μs-scale utilization holes of the microservice. We propose critical mechanisms, including separate memory paths for the master-thread and filler-threads, to enable master-cores to borrow filler-threads while protecting master-threads’ state from disruption. Duplexity facilitates fast master-thread restart when stalls resolve and minimizes the microservice’s QoS violation. Our evaluation demonstrates that Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average.

[1] Yan Solihin,et al. QoS policies and architecture for cache/memory in CMP platforms , 2007, SIGMETRICS '07.

[2] Norman P. Jouppi,et al. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction , 2003, MICRO.

[3] Kevin Skadron,et al. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] Thomas F. Wenisch,et al. Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[5] David G. Andersen,et al. Design Guidelines for High Performance RDMA Systems , 2016, USENIX ATC.

[6] Trevor N. Mudge,et al. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[7] David A. Wood,et al. WiDGET: Wisconsin decoupled grid execution tiles , 2010, ISCA.

[8] O. Mutlu,et al. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS XV.

[9] Thomas F. Wenisch,et al. µTune: Auto-Tuned Threading for OLDI Microservices , 2018, OSDI.

[10] Donald Yeung,et al. Transparent threads: resource sharing in SMT processors for high single-thread performance , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[11] Pradip Bose,et al. SMT-centric power-aware thread placement in chip multiprocessors , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[12] Brad Calder,et al. Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[13] Mattan Erez,et al. Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems , 2016, ASPLOS.

[14] Margaret Martonosi,et al. Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[15] Xiaosong Ma,et al. KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[16] Francisco J. Cazorla,et al. Predictable performance in SMT processors: synergy between the OS and SMTs , 2006, IEEE Transactions on Computers.

[17] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] Hamid Sarbazi-Azad,et al. Domino Temporal Data Prefetcher , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19] Michael Ferdman,et al. Taming the Killer Microsecond , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20] Hari Angepat,et al. A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21] Munira Hussain,et al. Need for Speed : Comparing FDR and EDR InfiniBand ( Part 2 ) , 2018 .

[22] Christina Delimitrou,et al. The Architectural Implications of Cloud Microservices , 2018, IEEE Computer Architecture Letters.

[23] Babak Falsafi,et al. Scale-out NUMA , 2014, ASPLOS.

[24] Christoforos E. Kozyrakis,et al. From chaos to QoS: case studies in CMP resource management , 2007, CARN.

[25] David G. Lowe,et al. Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Asit K. Mishra,et al. METE: meeting end-to-end QoS in multicores through system-wide resource management , 2011, PERV.

[27] Brahim Medjahed,et al. A Query Rewriting Approach for Web Service Composition , 2010, IEEE Transactions on Services Computing.

[28] Hamid Sarbazi-Azad,et al. BiNoCHS: Bimodal network-on-chip for CPU-GPU heterogeneous systems , 2017, 2017 Eleventh IEEE/ACM International Symposium on Networks-on-Chip (NOCS).

[29] Phil Rogers,et al. Heterogeneous system architecture overview , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[30] Dan Tsafrir,et al. The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) , 2007, ExpCS '07.

[31] Brad Fitzpatrick,et al. Distributed caching with memcached , 2004 .

[32] Sai Prashanth Muralidhara,et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33] Gu-Yeon Wei,et al. Quantifying sources of error in McPAT and potential impacts on architectural studies , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[34] Satish Narayanasamy,et al. Language-level persistency , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[35] James E. Smith,et al. Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[36] Daniel Sánchez,et al. Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[37] Christina Delimitrou,et al. Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[38] David Wentzlaff,et al. MITTS: Memory Inter-arrival Time Traffic Shaping , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[39] Thomas F. Wenisch,et al. Deconstructing the Tail at Scale Effect Across Network Protocols , 2017, ArXiv.

[40] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[41] Jeffrey H. Meyerson,et al. The Go Programming Language , 2014, IEEE Softw..

[42] Kang G. Shin,et al. Efficient Memory Disaggregation with Infiniswap , 2017, NSDI.

[43] Jin-Soo Kim,et al. NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs , 2016, HotStorage.

[44] Satish Narayanasamy,et al. Persistency for synchronization-free regions , 2018, PLDI.

[45] Norman P. Jouppi,et al. Conjoined-Core Chip Multiprocessing , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[46] David A. Patterson,et al. Attack of the killer microseconds , 2017, Commun. ACM.

[47] David G. Andersen,et al. Using RDMA efficiently for key-value services , 2015, SIGCOMM 2015.

[48] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[49] Nam Sung Kim,et al. GPU register file virtualization , 2015, MICRO.

[50] Hamid Sarbazi-Azad,et al. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching , 2018, ASPLOS.

[51] Gabriel H. Loh,et al. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[52] Josep Torrellas,et al. Survive: Pointer-Based In-DRAM Incremental Checkpointing for Low-Cost Data Persistence and Rollback-Recovery , 2017, IEEE Computer Architecture Letters.

[53] Chen Ding,et al. Quantifying the cost of context switch , 2007, ExpCS '07.

[54] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[55] Gérard Boudol,et al. Fair Cooperative Multithreading , 2007, CONCUR.

[56] Christina Delimitrou,et al. Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[57] Ming Zhao,et al. Client-side Flash Caching for Cloud Systems , 2014, SYSTOR 2014.

[58] Thomas F. Wenisch,et al. Efficiently Scaling Out-of-Order Cores for Simultaneous Multithreading , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[59] Zeshan Chishti,et al. Optimal Power/Performance Pipeline Depth for SMT in Scaled Technologies , 2008, IEEE Transactions on Computers.

[60] Junjie Wu,et al. BigHouse: A simulation infrastructure for data center systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[61] Vivek Sarkar,et al. RegMutex: Inter-Warp GPU Register Time-Sharing , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[62] Christopher Frost,et al. Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[63] Daniel Mossé,et al. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[64] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[65] Thomas F. Wenisch,et al. Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[66] Daniel Sánchez,et al. Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[67] Thomas F. Wenisch,et al. Thermostat: Application-transparent Page Management for Two-tiered Main Memory , 2017, ASPLOS.

[68] Engin Ipek,et al. Core fusion: accommodating software diversity in chip multiprocessors , 2007, ISCA '07.

[69] Thomas F. Wenisch,et al. Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[70] Ronald N. Kalla,et al. IBM Power9 Processor Architecture , 2017, IEEE Micro.

[71] André Seznec,et al. Out-of-order execution may not be cost-effective on processors featuring simultaneous multithreading , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[72] Christoforos E. Kozyrakis,et al. Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[73] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[74] Yale N. Patt,et al. MorphCore: An Energy-Efficient Microarchitecture for High Performance ILP and High Throughput TLP , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[75] Martin Porter,et al. Snowball: A language for stemming algorithms , 2001 .

[76] Shiliang Hu,et al. LASER: Light, Accurate Sharing dEtection and Repair , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[77] Sriram Sankar,et al. Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[78] Thomas F. Wenisch,et al. Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[79] Francisco J. Cazorla,et al. QoS for high-performance SMT processors in embedded systems , 2004, IEEE Micro.

[80] Thomas F. Wenisch,et al. PowerNap: eliminating server idle power , 2009, ASPLOS.

[81] Dean M. Tullsen,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[82] R. Govindarajan,et al. Probabilistic Shared Cache Management (PriSM) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[83] Gu-Yeon Wei,et al. Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[84] Beng Chin Ooi,et al. A Performance Study of Big Data on Small Nodes , 2015, Proc. VLDB Endow..

[85] Thomas F. Wenisch,et al. μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[86] Lingjia Tang,et al. SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[87] Christina Delimitrou,et al. Workload characterization of interactive cloud services on big and small server platforms , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[88] Scott A. Mahlke,et al. Composite Cores: Pushing Heterogeneity Into a Core , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[89] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[90] MudgeTrevor,et al. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008 .

[91] Mahmut T. Kandemir,et al. SHARP control: Controlled shared cache management in chip multiprocessors , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[92] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[93] David Wentzlaff,et al. The sharing architecture: sub-core configurability for IaaS clouds , 2014, ASPLOS.

[94] Steven K. Reinhardt,et al. The impact of resource partitioning on SMT processors , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[95] Sanjay Kumar,et al. System software for persistent memory , 2014, EuroSys '14.

[96] Margo I. Seltzer,et al. Flash Caching on the Storage Client , 2013, USENIX Annual Technical Conference.

[97] Mor Harchol-Balter,et al. Performance Modeling and Design of Computer Systems: Queueing Theory in Action , 2013 .

[98] Miguel Castro,et al. FaRM: Fast Remote Memory , 2014, NSDI.

[99] Luiz André Barroso,et al. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[100] Xi Yang,et al. Elfen Scheduling: Fine-Grain Principled Borrowing from Latency-Critical Workloads Using Simultaneous Multithreading , 2016, USENIX Annual Technical Conference.

[101] Ravi Iyer,et al. Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[102] Dan Williams,et al. Platform Storage Performance With 3D XPoint Technology , 2017, Proceedings of the IEEE.

[103] S. Winkel. Optimal versus Heuristic Global Code Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[104] Christoforos E. Kozyrakis,et al. IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[105] Hamid Sarbazi-Azad,et al. Bingo Spatial Data Prefetcher , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[106] Eunyoung Jeong,et al. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[107] Christoforos E. Kozyrakis,et al. Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[108] Hosung Park,et al. What is Twitter, a social network or a news media? , 2010, WWW '10.

[109] Norman P. Jouppi,et al. Heterogeneous chip multiprocessors , 2005, Computer.

[110] Rasmus Pagh,et al. Cuckoo Hashing , 2001, Encyclopedia of Algorithms.

[111] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[112] Timothy Roscoe,et al. Decoupling Cores, Kernels, and Operating Systems , 2014, OSDI.

[113] Jack L. Lo,et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[114] Lingjia Tang,et al. Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[115] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[116] Babak Falsafi,et al. Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[117] Raju Rangaswami,et al. Centaur: Host-Side SSD Caching for Storage Performance Control , 2015, 2015 IEEE International Conference on Autonomic Computing.

[118] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[119] Ankit Singla,et al. Enabling Efficient RDMA-based Synchronous Mirroring of Persistent Memory Transactions , 2018, ArXiv.

[120] Ronald G. Dreslinski,et al. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[121] Thomas F. Wenisch,et al. Spatio-temporal memory streaming , 2009, ISCA '09.

[122] Urs Hölzle,et al. Brawny cores still beat wimpy cores, most of the time , 2010 .

[123] Thu D. Nguyen,et al. Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[124] Thomas F. Wenisch,et al. Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[125] Christina Delimitrou,et al. Amdahl's law for tail latency , 2018, Commun. ACM.

[126] Edouard Bugnion,et al. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[127] Laxmi N. Bhuyan,et al. μDPM: Dynamic Power Management for the Microsecond Era , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[128] Christoforos E. Kozyrakis,et al. ReFlex: Remote Flash ≈ Local Flash , 2017, ASPLOS.

[129] Christoforos E. Kozyrakis,et al. Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[130] Daniel Sánchez,et al. Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[131] Jinyang Li,et al. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store , 2013, USENIX ATC.

[132] Carsten Binnig,et al. The End of Slow Networks: It's Time for a Redesign , 2015, Proc. VLDB Endow..

[133] Stijn Eyerman,et al. Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[134] Marcos K. Aguilera,et al. Remote memory in the age of fast networks , 2017, SoCC.

[135] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[136] Nael B. Abu-Ghazaleh,et al. CORF: Coalescing Operand Register File for GPUs , 2019, ASPLOS.

[137] Kushagra Vaid,et al. Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[138] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[139] Steve Byan,et al. Mercury: Host-side flash caching for the data center , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[140] Aart J. C. Bik,et al. Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[141] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .

[142] Onur Mutlu,et al. MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices , 2018, FAST.

[143] Bin Fan,et al. MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing , 2013, NSDI.