Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

Most of the modern cloud web services execute on top of runtime environments like .NET's Common Language Runtime or Java Runtime Environment. On the one hand, runtime environments provide several off-the-shelf benefits like code security and cross-platform execution. On the other hand, runtime's features such as just-in-time compilation and automatic memory management add a non-deterministic overhead to the overall service time, increasing the tail of the latency distribution. In this context, the Garbage Collector (GC) is among the leading causes of high tail latency. To tackle this problem, we developed the Garbage Collector Control Interceptor (GCI) - a request interceptor algorithm, which is agnostic regarding the cloud service language, internals, and its incoming load. GCI is wholly decentralized and improves the tail latency of cloud services by making sure that service instances shed the incoming load while cleaning up the runtime heap. We evaluated GCI's effectiveness in a stateful service prototype, varying the number of available instances. Our results showed that using GCI eliminates the impact of the garbage collection on the service latency for small (4 nodes) and large (64 nodes) deployments with no throughput loss.

[1]  Francisco Vilar Brasileiro,et al.  A User-Based Model of Grid Computing Workloads , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[2]  Will Reese,et al.  Nginx: the high-performance web server and reverse proxy , 2008 .

[3]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[4]  Michael Isard,et al.  Broom: Sweeping Out Garbage Collection from Big Data Systems , 2015, HotOS.

[5]  Amer Diwan,et al.  Wake up and smell the coffee: evaluation methodology for the 21st century , 2008, CACM.

[6]  Amit A. Levy,et al.  Blade: A Data Center Garbage Collector , 2015, ArXiv.

[7]  Sanath Jayasena,et al.  Auto-Tuning the Java Virtual Machine , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[8]  C.A.P.S. Martins,et al.  JVM Configuration Parameters Space Exploration for Performance Evaluation of Parallel Applications , 2015 .

[9]  David Detlefs,et al.  Garbage-first garbage collection , 2004, ISMM '04.

[10]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[11]  John Kubiatowicz,et al.  Taurus: A Holistic Language Runtime System for Coordinating Distributed Managed-Language Applications , 2016, ASPLOS.

[12]  Leo A. Meyerovich,et al.  Empirical analysis of programming language adoption , 2013, OOPSLA.

[13]  Frank Yellin,et al.  The Java Virtual Machine Specification , 1996 .

[14]  Jean-Marc Vincent,et al.  Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.

[15]  Jeffrey Dean,et al.  Achieving Rapid Response Times in Large Online Services , 2012 .

[16]  João Brunet,et al.  Using Load Shedding to Fight Tail-Latency on Runtime-Based Services , 2017 .

[17]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[18]  Nhan Nguyen,et al.  NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines , 2015, ASPLOS.

[19]  Henry Li Introducing Windows Azure , 2009 .

[20]  Jeff Carpenter,et al.  Cassandra: The Definitive Guide , 2010 .

[21]  Geoffrey Phipps Comparing observed bug and productivity rates for Java and C++ , 1999 .

[22]  Witawas Srisa-an,et al.  Garbage collection: Java application servers' Achilles heel , 2008, Sci. Comput. Program..

[23]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[24]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[25]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[26]  Anja Feldmann,et al.  C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection , 2015, NSDI.

[27]  Robert Daigneau,et al.  Service Design Patterns: Fundamental Design Solutions for SOAP/WSDL and RESTful Web Services , 2011 .

[28]  John Allen,et al.  Scuba: Diving into Data at Facebook , 2013, Proc. VLDB Endow..

[29]  Srikanth Kandula,et al.  Speeding up distributed request-response workflows , 2013, SIGCOMM.

[30]  Dror G. Feitelson,et al.  Workload Modeling for Computer Systems Performance Evaluation , 2015 .

[31]  Michael Wolf,et al.  The pauseless GC algorithm , 2005, VEE '05.

[32]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[33]  Jose L. Ugia Gonzalez,et al.  Building Your Next Big Thing with Google Cloud Platform , 2015, Apress.

[34]  Adel Taweel,et al.  Open Source In-Memory Data Grid Systems: Benchmarking Hazelcast and Infinispan , 2017, ICPE.