Work stealing for interactive services to meet target latency

Interactive web services increasingly drive critical business workloads such as search, advertising, games, shopping, and finance. Whereas optimizing parallel programs and distributed server systems have historically focused on average latency and throughput, the primary metric for interactive applications is instead consistent responsiveness, i.e., minimizing the number of requests that miss a target latency. This paper is the first to show how to generalize work-stealing, which is traditionally used to minimize the makespan of a single parallel job, to optimize for a target latency in interactive services with multiple parallel requests. We design a new adaptive work stealing policy, called tail-control, that reduces the number of requests that miss a target latency. It uses instantaneous request progress, system load, and a target latency to choose when to parallelize requests with stealing, when to admit new requests, and when to limit parallelism of large requests. We implement this approach in the Intel Thread Building Block (TBB) library and evaluate it on real-world workloads and synthetic workloads. The tail-control policy substantially reduces the number of requests exceeding the desired target latency and delivers up to 58% relative improvement over various baseline policies. This generalization of work stealing for multiple requests effectively optimizes the number of requests that complete within a target latency, a key metric for interactive services.

[1]  David Grove,et al.  Friendly barriers: efficient work-stealing with return barriers , 2014, VEE '14.

[2]  P. Glasserman,et al.  Estimating security price derivatives using simulation , 1996 .

[3]  Alan L. Cox,et al.  Adaptive parallelism for web search , 2013, EuroSys '13.

[4]  Yi Guo,et al.  The habanero multicore software research project , 2009, OOPSLA Companion.

[5]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[6]  Charles E. Leiserson,et al.  Programming with exceptions in JCilk , 2006, Sci. Comput. Program..

[7]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[8]  Jaejin Lee,et al.  Adaptive execution techniques for SMT multiprocessor architectures , 2005, PPOPP.

[9]  Arun Raman,et al.  Parallelism orchestration using DoPE: the degree of parallelism executive , 2011, PLDI '11.

[10]  Mor Harchol-Balter,et al.  Web servers under overload: How scheduling can help , 2006, TOIT.

[11]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[12]  David Grove,et al.  Work-stealing without the baggage , 2012, OOPSLA '12.

[13]  Dimitrios S. Nikolopoulos,et al.  Online power-performance adaptation of multithreaded programs using hardware event-based prediction , 2006, ICS '06.

[14]  Farzin Maghoul,et al.  Deciphering mobile search patterns: a study of Yahoo! mobile search queries , 2008, WWW.

[15]  Shaolei Ren,et al.  Exploiting Processor Heterogeneity in Interactive Services , 2013, ICAC.

[16]  Ricardo Bianchini,et al.  Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services , 2015, ASPLOS.

[17]  Leonard Kleinrock,et al.  Time-shared Systems: a theoretical treatment , 1967, JACM.

[18]  Alan Jay Smith,et al.  Improving dynamic voltage scaling algorithms with PACE , 2001, SIGMETRICS '01.

[19]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[20]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[21]  Seung-won Hwang,et al.  Predictive parallelization: taming tail latencies in web search , 2014, SIGIR.

[22]  Yuxiong He,et al.  Adaptive Scheduling with Parallelism Feedback , 2006, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  Charles E. Leiserson,et al.  On-the-Fly Pipeline Parallelism , 2015, ACM Trans. Parallel Comput..

[24]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[25]  Yale N. Patt,et al.  Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs , 2008, ASPLOS.

[26]  Ronald G. Dreslinski,et al.  Adrenaline: Pinpointing and reining in tail queries with quick voltage boosting , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[27]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[28]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[29]  Seung-won Hwang,et al.  Delayed-Dynamic-Selective (DDS) Prediction for Reducing Extreme Tail Latency in Web Search , 2015, WSDM.

[30]  Glenn Reinman,et al.  Enabling real-time physics simulation in future interactive entertainment , 2006, Sandbox '06.

[31]  Mascon Global Limited Parallelizing a Computationally Intensive Financial R Application with Zircon Technology Zircon Computing LLC , 2010 .

[32]  Laxmi N. Bhuyan,et al.  Thread reinforcer: Dynamically determining number of threads via OS level monitoring , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[33]  Gonzalo Cortazar,et al.  The valuation of multidimensional American real options using the LSM simulation method , 2008, Comput. Oper. Res..

[34]  Vivek Sarkar,et al.  Habanero-Java: the new adventures of old X10 , 2011, PPPJ.

[35]  Alexandros Stamatakis,et al.  Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems , 2007, Parallel Comput..

[36]  Raj Vaswani,et al.  A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.

[37]  Gu-Yeon Wei,et al.  Profiling a Warehouse-Scale Computer , 2016, IEEE Micro.

[38]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[39]  James R. Larus,et al.  Zeta: scheduling interactive services with partial execution , 2012, SoCC '12.

[40]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[41]  Adam Wierman,et al.  Is Tail-Optimal Scheduling Possible? , 2012, Oper. Res..

[42]  Nathan Clark,et al.  Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications , 2010, ISCA.

[43]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[44]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[45]  Srikanth Kandula,et al.  Speeding up distributed request-response workflows , 2013, SIGCOMM.

[46]  Sem C. Borst,et al.  The impact of the service discipline on delay asymptotics , 2003, Perform. Evaluation.

[47]  Dimitrios S. Nikolopoulos,et al.  Effective cross-platform, multilevel parallelism via dynamic adaptive execution , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[48]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Yuxiong He,et al.  Adaptive work stealing with parallelism feedback , 2007, PPoPP.

[50]  Yuxiong He,et al.  Provably Efficient Online Nonclairvoyant Adaptive Scheduling , 2007, IEEE Transactions on Parallel and Distributed Systems.