论文信息 - Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

Improving Tail Latency of Stateful Cloud Services via GC Control and Load Shedding

Most of the modern cloud web services execute on top of runtime environments like .NET's Common Language Runtime or Java Runtime Environment. On the one hand, runtime environments provide several off-the-shelf benefits like code security and cross-platform execution. On the other hand, runtime's features such as just-in-time compilation and automatic memory management add a non-deterministic overhead to the overall service time, increasing the tail of the latency distribution. In this context, the Garbage Collector (GC) is among the leading causes of high tail latency. To tackle this problem, we developed the Garbage Collector Control Interceptor (GCI) - a request interceptor algorithm, which is agnostic regarding the cloud service language, internals, and its incoming load. GCI is wholly decentralized and improves the tail latency of cloud services by making sure that service instances shed the incoming load while cleaning up the runtime heap. We evaluated GCI's effectiveness in a stateful service prototype, varying the number of available instances. Our results showed that using GCI eliminates the impact of the garbage collection on the service latency for small (4 nodes) and large (64 nodes) deployments with no throughput loss.

Thiago Emmanuel Pereira | Daniel Fireman | João Brunet | David Quaresma | Raquel Lopes

[1] Francisco Vilar Brasileiro,et al. A User-Based Model of Grid Computing Workloads , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[2] Will Reese,et al. Nginx: the high-performance web server and reverse proxy , 2008 .

[3] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[4] Michael Isard,et al. Broom: Sweeping Out Garbage Collection from Big Data Systems , 2015, HotOS.

[5] Amer Diwan,et al. Wake up and smell the coffee: evaluation methodology for the 21st century , 2008, CACM.

[6] Amit A. Levy,et al. Blade: A Data Center Garbage Collector , 2015, ArXiv.

[7] Sanath Jayasena,et al. Auto-Tuning the Java Virtual Machine , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[8] C.A.P.S. Martins,et al. JVM Configuration Parameters Space Exploration for Performance Evaluation of Parallel Applications , 2015 .

[9] David Detlefs,et al. Garbage-first garbage collection , 2004, ISMM '04.

[10] F. Massey. The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[11] John Kubiatowicz,et al. Taurus: A Holistic Language Runtime System for Coordinating Distributed Managed-Language Applications , 2016, ASPLOS.

[12] Leo A. Meyerovich,et al. Empirical analysis of programming language adoption , 2013, OOPSLA.

[13] Frank Yellin,et al. The Java Virtual Machine Specification , 1996 .

[14] Jean-Marc Vincent,et al. Discovering Statistical Models of Availability in Large Distributed Systems: An Empirical Study of SETI@home , 2011, IEEE Transactions on Parallel and Distributed Systems.

[15] Jeffrey Dean,et al. Achieving Rapid Response Times in Large Online Services , 2012 .

[16] João Brunet,et al. Using Load Shedding to Fight Tail-Latency on Runtime-Based Services , 2017 .

[17] Prashant Malik,et al. Cassandra: a decentralized structured storage system , 2010, OPSR.

[18] Nhan Nguyen,et al. NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines , 2015, ASPLOS.

[19] Henry Li. Introducing Windows Azure , 2009 .