论文信息 - µTune: Auto-Tuned Threading for OLDI Microservices

µTune: Auto-Tuned Threading for OLDI Microservices

Modern On-Line Data Intensive (OLDI) applications have evolved from monolithic systems to instead comprise numerous, distributed microservices interacting via Remote Procedure Calls (RPCs). Microservices face sub-millisecond (sub-ms) RPC latency goals, much tighter than their monolithic counterparts that must meet ≥ 100 ms latency targets. Sub-ms-scale threading and concurrency design effects that were once insignificant for such monolithic services can now come to dominate in the sub-ms-scale microservice regime. We investigate how threading design critically impacts microservice tail latency by developing a taxonomy of threading models--a structured understanding of the implications of how microservices manage concurrency and interact with RPC interfaces under wide-ranging loads. We develop µTune, a system that has two features: (1) a novel framework that abstracts threading model implementation from application code, and (2) an automatic load adaptation system that curtails microservice tail latency by exploiting inherent latency trade-offs revealed in our taxonomy to transition among threading models. We study µTune in the context of four OLDI applications to demonstrate up to 1.9× tail latency improvement over static threading choices and state-of-the-art adaptation techniques.

Thomas F. Wenisch | Akshitha Sriraman | T. Wenisch | Akshitha Sriraman

[1] Steve Vinoski,et al. Node.js: Using JavaScript to Build High-Performance Network Programs , 2010, IEEE Internet Comput..

[2] Willy Zwaenepoel,et al. Flash: An efficient and portable Web server , 1999, USENIX Annual Technical Conference, General Track.

[3] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Douglas C. Schmidt,et al. APPLYING THE PROACTOR PATTERN TO HIGH-PERFORMANCE WEB SERVERS , 1998 .

[5] David A. Patterson,et al. Attack of the killer microseconds , 2017, Commun. ACM.

[6] Alexandr Andoni,et al. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[7] T. N. Vijaykumar,et al. Deadline-aware datacenter tcp (D2TCP) , 2012, SIGCOMM '12.

[8] Luis Ceze,et al. NCAM: Near-Data Processing for Nearest Neighbor Search , 2015, MEMSYS.

[9] 吉野智興,et al. Programmer's guide , 1993 .

[10] Michael F. P. O'Boyle,et al. Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[11] Eric A. Brewer,et al. USENIX Association Proceedings of HotOS IX : The 9 th Workshop on Hot Topics in Operating Systems , 2003 .

[12] Alexandr Andoni,et al. Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[13] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14] Roger M. Needham,et al. Denial of service , 1993, CCS '93.

[15] David E. Culler,et al. SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[16] Ronald G. Dreslinski,et al. Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[17] Brad Fitzpatrick,et al. Distributed caching with memcached , 2004 .

[18] Zhe Wang,et al. Modeling LSH for performance tuning , 2008, CIKM '08.

[19] Lingjia Tang,et al. Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[20] Thu D. Nguyen,et al. Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[22] Thomas F. Wenisch,et al. Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[23] Trevor Darrell,et al. Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24] Yuzuru Tanaka,et al. Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere , 2007, WADS.

[25] Luiz André Barroso,et al. The Case for Energy-Proportional Computing , 2007, Computer.

[26] Mayank Bawa,et al. LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[27] Roberto Rojas-Cessa,et al. Schemes for Fast Transmission of Flows in Data Center Networks , 2015, IEEE Communications Surveys & Tutorials.

[28] Douglas C. Schmidt,et al. Experience Using Design Patterns to Evolve Communication Software Across Diverse OS Platforms , 1995, ECOOP.

[29] Douglas C. Schmidt,et al. JAWS: A Framework for High-performance Web Servers , 1998 .

[30] Ron Kohavi,et al. Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[31] Scott F. Midkiff,et al. Denial-of-Service in Wireless Sensor Networks: Attacks and Defenses , 2008, IEEE Pervasive Computing.

[32] Qingyang Wang,et al. Performance Comparison of Web Servers with Different Architectures: A Case Study Using High Concurrency Workload , 2015, 2015 Third IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb).

[33] Jonathan Goldstein,et al. MTCache: transparent mid-tier database caching in SQL server , 2004, Proceedings. 20th International Conference on Data Engineering.

[34] Amitabh Sinha,et al. Non-Clairvoyant Scheduling for Minimizing Mean Slowdown , 2003, Algorithmica.

[35] Seung-won Hwang,et al. Predictive parallelization: taming tail latencies in web search , 2014, SIGIR.

[36] Edouard Bugnion,et al. ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[37] Yuxiong He,et al. Provably Efficient Online Nonclairvoyant Adaptive Scheduling , 2007, IEEE Transactions on Parallel and Distributed Systems.

[38] Hui Ding,et al. TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[39] Berkant Barla Cambazoglu,et al. Impact of response latency on user behavior in web search , 2014, SIGIR.

[40] Calton Pu,et al. A Study of Long-Tail Latency in n-Tier Systems: RPC vs. Asynchronous Invocations , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[41] David E. Culler,et al. SEDA: An Architecture for Scalable, Well-Conditioned Internet Services , 2001 .

[42] Ricardo Bianchini,et al. Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services , 2015, ASPLOS.

[43] Allan Kuchinsky,et al. Quality is in the eye of the beholder: meeting users' requirements for Internet quality of service , 2000, CHI.

[44] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[45] Thomas F. Wenisch,et al. Deconstructing the Tail at Scale Effect Across Network Protocols , 2017, ArXiv.

[46] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[47] Rubby Casallas,et al. Evaluating the monolithic and the microservice architecture pattern to deploy web applications in the cloud , 2015, 2015 10th Computing Colombian Conference (10CCC).

[48] Chita R. Das,et al. Characterizing Network Traffic in a Cluster-based, Multi-tier Data Center , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[49] Nathan Clark,et al. Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[50] Christoforos E. Kozyrakis,et al. IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[51] T.F. Abdelzaher,et al. Web server QoS management by adaptive content delivery , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[52] MullenTracy,et al. Analysis of optimal thread pool size , 2000 .

[53] Eitan Frachtenberg,et al. Reducing Query Latencies in Web Search Using Fine-Grained Parallelism , 2009, World Wide Web.

[54] Eunyoung Jeong,et al. mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[55] Christoforos E. Kozyrakis,et al. Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[56] W. John Wilbur,et al. The automatic identification of stop words , 1992, J. Inf. Sci..

[57] Zhe Wang,et al. Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[58] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[59] Dimitrios S. Nikolopoulos,et al. Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[60] Raj Vaswani,et al. A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.

[61] Maria Kihl,et al. Web server performance modeling using an M/G/1/K*PS queue , 2003, 10th International Conference on Telecommunications, 2003. ICT 2003..

[62] Peter R. Pietzuch,et al. Distributed event-based systems , 2006 .

[63] Christoforos E. Kozyrakis,et al. Energy proportionality and workload consolidation for latency-critical applications , 2015, SoCC.

[64] Luiz André Barroso,et al. Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[65] William B. March,et al. MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[66] Alexandros Stamatakis,et al. Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems , 2007, Parallel Comput..

[67] Panos Kalnis,et al. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[68] Roy T. Fielding,et al. The Apache HTTP Server Project , 1997, IEEE Internet Comput..

[69] Timothy Roscoe,et al. Arrakis , 2014, OSDI.

[70] Brahim Medjahed,et al. A Query Rewriting Approach for Web Service Composition , 2010, IEEE Transactions on Services Computing.

[71] Dan Tsafrir,et al. The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) , 2007, ExpCS '07.

[72] Daniel Sánchez,et al. Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[73] Panos Kalnis,et al. Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[74] Laxmi N. Bhuyan,et al. Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[75] Thomas F. Wenisch,et al. μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[76] Jialin Li,et al. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[77] Mike Amundsen,et al. Microservice Architecture: Aligning Principles, Practices, and Culture , 2016 .

[78] Ryan Johnson,et al. Decoupling contention management from scheduling , 2010, ASPLOS XV.

[79] Xiaola Lin,et al. Analysis of optimal thread pool size , 2000, OPSR.

[80] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[81] Jeffrey S. Chase,et al. Balance of power: dynamic thermal management for Internet data centers , 2005, IEEE Internet Computing.

[82] Douglas C. Schmidt,et al. Applying patterns to develop extensible ORB middleware , 1999, IEEE Commun. Mag..

[83] David M. Brooks,et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[84] Seung-won Hwang,et al. Delayed-Dynamic-Selective (DDS) Prediction for Reducing Extreme Tail Latency in Web Search , 2015, WSDM.

[85] Eric N. Herness,et al. WebSphere Application Server: A foundation for on demand computing , 2004, IBM Syst. J..

[86] Ruby B. Lee,et al. Distributed Denial of Service: Taxonomies of Attacks, Tools, and Countermeasures , 2004, PDCS.

[87] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.

[88] Josiah L. Carlson,et al. Redis in Action , 2013 .

[89] T. N. Vijaykumar,et al. TimeTrader: Exploiting latency tail to save datacenter energy for online search , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[90] Michael A. Casey,et al. Locality-Sensitive Hashing for Finding Nearest Neighbors , 2008 .

[91] Nicole Immorlica,et al. Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[92] Yale N. Patt,et al. Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[93] K. Langendoen,et al. Integrating polling, interrupts, and thread management , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[94] Amin Vahdat,et al. Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[95] Rafail Ostrovsky,et al. Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[96] David G. Lowe,et al. Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[97] Borko Furht,et al. Handbook of Cloud Computing , 2010 .

[98] Dimitrios S. Nikolopoulos,et al. Effective cross-platform, multilevel parallelism via dynamic adaptive execution , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[99] Chen Ding,et al. Quantifying the cost of context switch , 2007, ExpCS '07.

[100] Dong Liu,et al. The Reverse C10K Problem for Server-Side Mashups , 2009, ICSOC Workshops.

[101] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[102] Gu-Yeon Wei,et al. Tradeoffs between power management and tail latency in warehouse-scale applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[103] Hyeontaek Lim,et al. MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[104] Antony I. T. Rowstron,et al. Better never than late: meeting deadlines in datacenter networks , 2011, SIGCOMM.

[105] Shiliang Hu,et al. LASER: Light, Accurate Sharing dEtection and Repair , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[106] David R. Cheriton,et al. Comparing the performance of web server architectures , 2007, EuroSys '07.

[107] Jaejin Lee,et al. Adaptive execution techniques for SMT multiprocessor architectures , 2005, PPOPP.

[108] Dmitry Namiot,et al. On micro-services architecture , 2014 .

[109] Christoforos E. Kozyrakis,et al. Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[110] Luiz André Barroso,et al. The tail at scale , 2013, CACM.