ThrottleBot - Performance without Insight

Large scale applications are increasingly built by composing sets of microservices. In this model the functionality for a single application might be split across 100s or 1000s of microservices. Resource provisioning for these applications is complex, requiring administrators to understand both the functioning of each microservice, and dependencies between microservices in an application. In this paper we present ThrottleBot, a system that automates the process of determining what resource when allocated to which microservice is likely to have the greatest impact on application performance. We demonstrate the efficacy of our approach by applying ThrottleBot to both synthetic and real world applications. We believe that ThrottleBot when combined with existing microservice orchestrators, e.g., Kubernetes, enables push-button deployment of web scale applications.

[1]  Emery D. Berger,et al.  Coz: finding code that counts with causal profiling , 2015, USENIX Annual Technical Conference.

[2]  Andrea C. Arpaci-Dusseau,et al.  All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications , 2014, OSDI.

[3]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[4]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[6]  Damon Wischik,et al.  SHRiNK: a method for enabling scaleable performance prediction and efficient network simulation , 2005, IEEE/ACM Transactions on Networking.

[7]  Wonho Kim,et al.  Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services , 2016, OSDI.

[8]  William N. Scherer,et al.  Advanced contention management for dynamic software transactional memory , 2005, PODC '05.

[9]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[10]  Randy H. Katz,et al.  Selecting the best VM across multiple public clouds: a data-driven performance modeling approach , 2017, SoCC.

[11]  Gregory R. Ganger,et al.  The DiskSim Simulation Environment Version 4.0 Reference Manual (CMU-PDL-08-101) , 1998 .

[12]  Marc Shapiro,et al.  A study of the scalability of stop-the-world garbage collectors on multicores , 2013, ASPLOS '13.

[13]  Srikanth Kandula,et al.  Resource Management with Deep Reinforcement Learning , 2016, HotNets.

[14]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[15]  Charles Anderson,et al.  Docker , 2015, IEEE Softw..

[16]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[17]  Amin Vahdat,et al.  DieCast: Testing Distributed Systems with an Accurate Scale Model , 2008, TOCS.

[18]  Benjamin Recht,et al.  KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[19]  Amin Vahdat,et al.  To infinity and beyond: time warped network emulation , 2005, SOSP '05.

[20]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[21]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.