Chopping off the Tail: Bounded Non-Determinism for Real-Time Accelerators

Modern data centers run web-scale applications on tens of thousands of servers, generating tens of thousands of Remote Procedure Calls (RPCs) to backend services for each incoming user request. Tail latency, due to a small fraction of randomly slow RPCs, decreases the performance of these incoming requests, degrades users’ quality of experience, and limits disaggregation (applications’ ability to scale across a data center). We argue that current approaches to improve tail latency (especially, those bounding computation time) are insufficient, even with (reconfigurable-) hardware accelerators. Instead, to chop off the tail, datacenter services should dynamically trade correctness (or result quality) for timeliness, providing bounded latency with near-ideal accuracy. In this paper, we discuss how the increasing prevalence of machine learning (including search techniques like approximate nearest neighbor and PageRank), perceptual algorithms (like computational photography and image/video caching), and natural language processing lets modern hardware accelerators make these dynamic correctness tradeoffs while improving users’ quality of experience.

[1]  Christina Delimitrou,et al.  Pliant: Leveraging Approximation to Improve Datacenter Resource Efficiency , 2018, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Aaron Klein,et al.  Hyperparameter Optimization , 2017, Encyclopedia of Machine Learning and Data Mining.

[3]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[4]  Christoforos E. Kozyrakis,et al.  Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency , 2019, NSDI.

[5]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[6]  Mark Handley,et al.  Re-architecting datacenter networks and stacks for low latency and high performance , 2017, SIGCOMM.

[7]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[8]  Hari Balakrishnan,et al.  Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads , 2019, NSDI.

[9]  Peter C. Ma,et al.  Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[10]  John K. Ousterhout,et al.  Homa: a receiver-driven low-latency transport protocol using network priorities , 2018, SIGCOMM.

[11]  Hemal Shah,et al.  Remote Direct Memory Access (RDMA) Protocol Extensions , 2014, RFC.

[12]  Babak Falsafi,et al.  The NEBULA RPC-Optimized Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[13]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[14]  Nick McKeown,et al.  The Case for a Network Fast Path to the CPU , 2019, HotNets.

[15]  Adam Wierman,et al.  This Paper Is Included in the Proceedings of the 11th Usenix Symposium on Networked Systems Design and Implementation (nsdi '14). Grass: Trimming Stragglers in Approximation Analytics Grass: Trimming Stragglers in Approximation Analytics , 2022 .

[16]  Rui Han,et al.  AccuracyTrader: Accuracy-Aware Approximate Processing for Low Tail Latency and High Result Accuracy in Cloud Online Services , 2016, 2016 45th International Conference on Parallel Processing (ICPP).