µTune: Auto-Tuned Threading for OLDI Microservices

Modern On-Line Data Intensive (OLDI) applications have evolved from monolithic systems to instead comprise numerous, distributed microservices interacting via Remote Procedure Calls (RPCs). Microservices face sub-millisecond (sub-ms) RPC latency goals, much tighter than their monolithic counterparts that must meet ≥ 100 ms latency targets. Sub-ms-scale threading and concurrency design effects that were once insignificant for such monolithic services can now come to dominate in the sub-ms-scale microservice regime. We investigate how threading design critically impacts microservice tail latency by developing a taxonomy of threading models--a structured understanding of the implications of how microservices manage concurrency and interact with RPC interfaces under wide-ranging loads. We develop µTune, a system that has two features: (1) a novel framework that abstracts threading model implementation from application code, and (2) an automatic load adaptation system that curtails microservice tail latency by exploiting inherent latency trade-offs revealed in our taxonomy to transition among threading models. We study µTune in the context of four OLDI applications to demonstrate up to 1.9× tail latency improvement over static threading choices and state-of-the-art adaptation techniques.

[1]  Steve Vinoski,et al.  Node.js: Using JavaScript to Build High-Performance Network Programs , 2010, IEEE Internet Comput..

[2]  Willy Zwaenepoel,et al.  Flash: An efficient and portable Web server , 1999, USENIX Annual Technical Conference, General Track.

[3]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Douglas C. Schmidt,et al.  APPLYING THE PROACTOR PATTERN TO HIGH-PERFORMANCE WEB SERVERS , 1998 .

[5]  David A. Patterson,et al.  Attack of the killer microseconds , 2017, Commun. ACM.

[6]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[7]  T. N. Vijaykumar,et al.  Deadline-aware datacenter tcp (D2TCP) , 2012, SIGCOMM '12.

[8]  Luis Ceze,et al.  NCAM: Near-Data Processing for Nearest Neighbor Search , 2015, MEMSYS.

[9]  吉野 智興,et al.  Programmer's guide , 1993 .

[10]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[11]  Eric A. Brewer,et al.  USENIX Association Proceedings of HotOS IX : The 9 th Workshop on Hot Topics in Operating Systems , 2003 .

[12]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[13]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14]  Roger M. Needham,et al.  Denial of service , 1993, CCS '93.

[15]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[16]  Ronald G. Dreslinski,et al.  Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[17]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[18]  Zhe Wang,et al.  Modeling LSH for performance tuning , 2008, CIKM '08.

[19]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[20]  Thu D. Nguyen,et al.  Exploiting Heterogeneity for Tail Latency and Energy Efficiency , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[22]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[23]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[24]  Yuzuru Tanaka,et al.  Spherical LSH for Approximate Nearest Neighbor Search on Unit Hypersphere , 2007, WADS.

[25]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[26]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[27]  Roberto Rojas-Cessa,et al.  Schemes for Fast Transmission of Flows in Data Center Networks , 2015, IEEE Communications Surveys & Tutorials.

[28]  Douglas C. Schmidt,et al.  Experience Using Design Patterns to Evolve Communication Software Across Diverse OS Platforms , 1995, ECOOP.

[29]  Douglas C. Schmidt,et al.  JAWS: A Framework for High-performance Web Servers , 1998 .

[30]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[31]  Scott F. Midkiff,et al.  Denial-of-Service in Wireless Sensor Networks: Attacks and Defenses , 2008, IEEE Pervasive Computing.

[32]  Qingyang Wang,et al.  Performance Comparison of Web Servers with Different Architectures: A Case Study Using High Concurrency Workload , 2015, 2015 Third IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb).

[33]  Jonathan Goldstein,et al.  MTCache: transparent mid-tier database caching in SQL server , 2004, Proceedings. 20th International Conference on Data Engineering.

[34]  Amitabh Sinha,et al.  Non-Clairvoyant Scheduling for Minimizing Mean Slowdown , 2003, Algorithmica.

[35]  Seung-won Hwang,et al.  Predictive parallelization: taming tail latencies in web search , 2014, SIGIR.

[36]  Edouard Bugnion,et al.  ZygOS: Achieving Low Tail Latency for Microsecond-scale Networked Tasks , 2017, SOSP.

[37]  Yuxiong He,et al.  Provably Efficient Online Nonclairvoyant Adaptive Scheduling , 2007, IEEE Transactions on Parallel and Distributed Systems.

[38]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[39]  Berkant Barla Cambazoglu,et al.  Impact of response latency on user behavior in web search , 2014, SIGIR.

[40]  Calton Pu,et al.  A Study of Long-Tail Latency in n-Tier Systems: RPC vs. Asynchronous Invocations , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).

[41]  David E. Culler,et al.  SEDA: An Architecture for Scalable, Well-Conditioned Internet Services , 2001 .

[42]  Ricardo Bianchini,et al.  Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services , 2015, ASPLOS.

[43]  Allan Kuchinsky,et al.  Quality is in the eye of the beholder: meeting users' requirements for Internet quality of service , 2000, CHI.

[44]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[45]  Thomas F. Wenisch,et al.  Deconstructing the Tail at Scale Effect Across Network Protocols , 2017, ArXiv.

[46]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[47]  Rubby Casallas,et al.  Evaluating the monolithic and the microservice architecture pattern to deploy web applications in the cloud , 2015, 2015 10th Computing Colombian Conference (10CCC).

[48]  Chita R. Das,et al.  Characterizing Network Traffic in a Cluster-based, Multi-tier Data Center , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[49]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[50]  Christoforos E. Kozyrakis,et al.  IX: A Protected Dataplane Operating System for High Throughput and Low Latency , 2014, OSDI.

[51]  T.F. Abdelzaher,et al.  Web server QoS management by adaptive content delivery , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[52]  MullenTracy,et al.  Analysis of optimal thread pool size , 2000 .

[53]  Eitan Frachtenberg,et al.  Reducing Query Latencies in Web Search Using Fine-Grained Parallelism , 2009, World Wide Web.

[54]  Eunyoung Jeong,et al.  mTCP: a Highly Scalable User-level TCP Stack for Multicore Systems , 2014, NSDI.

[55]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[56]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..

[57]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[58]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[59]  Dimitrios S. Nikolopoulos,et al.  Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[60]  Raj Vaswani,et al.  A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.

[61]  Maria Kihl,et al.  Web server performance modeling using an M/G/1/K*PS queue , 2003, 10th International Conference on Telecommunications, 2003. ICT 2003..

[62]  Peter R. Pietzuch,et al.  Distributed event-based systems , 2006 .

[63]  Christoforos E. Kozyrakis,et al.  Energy proportionality and workload consolidation for latency-critical applications , 2015, SoCC.

[64]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[65]  William B. March,et al.  MLPACK: a scalable C++ machine learning library , 2012, J. Mach. Learn. Res..

[66]  Alexandros Stamatakis,et al.  Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems , 2007, Parallel Comput..

[67]  Panos Kalnis,et al.  Efficient and accurate nearest neighbor and closest pair search in high-dimensional space , 2010, TODS.

[68]  Roy T. Fielding,et al.  The Apache HTTP Server Project , 1997, IEEE Internet Comput..

[69]  Timothy Roscoe,et al.  Arrakis , 2014, OSDI.

[70]  Brahim Medjahed,et al.  A Query Rewriting Approach for Web Service Composition , 2010, IEEE Transactions on Services Computing.

[71]  Dan Tsafrir,et al.  The context-switch overhead inflicted by hardware interrupts (and the enigma of do-nothing loops) , 2007, ExpCS '07.

[72]  Daniel Sánchez,et al.  Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[73]  Panos Kalnis,et al.  Quality and efficiency in high dimensional nearest neighbor search , 2009, SIGMOD Conference.

[74]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[75]  Thomas F. Wenisch,et al.  μ Suite: A Benchmark Suite for Microservices , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[76]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[77]  Mike Amundsen,et al.  Microservice Architecture: Aligning Principles, Practices, and Culture , 2016 .

[78]  Ryan Johnson,et al.  Decoupling contention management from scheduling , 2010, ASPLOS XV.

[79]  Xiaola Lin,et al.  Analysis of optimal thread pool size , 2000, OPSR.

[80]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[81]  Jeffrey S. Chase,et al.  Balance of power: dynamic thermal management for Internet data centers , 2005, IEEE Internet Computing.

[82]  Douglas C. Schmidt,et al.  Applying patterns to develop extensible ORB middleware , 1999, IEEE Commun. Mag..

[83]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[84]  Seung-won Hwang,et al.  Delayed-Dynamic-Selective (DDS) Prediction for Reducing Extreme Tail Latency in Web Search , 2015, WSDM.

[85]  Eric N. Herness,et al.  WebSphere Application Server: A foundation for on demand computing , 2004, IBM Syst. J..

[86]  Ruby B. Lee,et al.  Distributed Denial of Service: Taxonomies of Attacks, Tools, and Countermeasures , 2004, PDCS.

[87]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[88]  Josiah L. Carlson,et al.  Redis in Action , 2013 .

[89]  T. N. Vijaykumar,et al.  TimeTrader: Exploiting latency tail to save datacenter energy for online search , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[90]  Michael A. Casey,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors , 2008 .

[91]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[92]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[93]  K. Langendoen,et al.  Integrating polling, interrupts, and thread management , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[94]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[95]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[96]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[97]  Borko Furht,et al.  Handbook of Cloud Computing , 2010 .

[98]  Dimitrios S. Nikolopoulos,et al.  Effective cross-platform, multilevel parallelism via dynamic adaptive execution , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[99]  Chen Ding,et al.  Quantifying the cost of context switch , 2007, ExpCS '07.

[100]  Dong Liu,et al.  The Reverse C10K Problem for Server-Side Mashups , 2009, ICSOC Workshops.

[101]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[102]  Gu-Yeon Wei,et al.  Tradeoffs between power management and tail latency in warehouse-scale applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).

[103]  Hyeontaek Lim,et al.  MICA: A Holistic Approach to Fast In-Memory Key-Value Storage , 2014, NSDI.

[104]  Antony I. T. Rowstron,et al.  Better never than late: meeting deadlines in datacenter networks , 2011, SIGCOMM.

[105]  Shiliang Hu,et al.  LASER: Light, Accurate Sharing dEtection and Repair , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[106]  David R. Cheriton,et al.  Comparing the performance of web server architectures , 2007, EuroSys '07.

[107]  Jaejin Lee,et al.  Adaptive execution techniques for SMT multiprocessor architectures , 2005, PPOPP.

[108]  Dmitry Namiot,et al.  On micro-services architecture , 2014 .

[109]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[110]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.