Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency

Interactive services often have large-scale parallel implementations. To deliver fast responses, the median and tail latencies of a service's components must be low. In this paper, we explore the hardware, OS, and application-level sources of poor tail latency in high throughput servers executing on multi-core machines. We model these network services as a queuing system in order to establish the best-achievable latency distribution. Using fine-grained measurements of three different servers (a null RPC service, Memcached, and Nginx) on Linux, we then explore why these servers exhibit significantly worse tail latencies than queuing models alone predict. The underlying causes include interference from background processes, request re-ordering caused by poor scheduling or constrained concurrency models, suboptimal interrupt routing, CPU power saving mechanisms, and NUMA effects. We systematically eliminate these factors and show that Memcached can achieve a median latency of 11 μs and a 99.9th percentile latency of 32 μs at 80% utilization on a four-core system. In comparison, a naïve deployment of Memcached at the same utilization on a single-core system has a median latency of 100 μs and a 99.9th percentile latency of 5 ms. Finally, we demonstrate that tradeoffs exist between throughput, energy, and tail latency.

[1]  E. N. Elnozahy,et al.  Energy Conservation Policies for Web Servers , 2003, USENIX Symposium on Internet Technologies and Systems.

[2]  Guillaume Pierre,et al.  Wikipedia workload analysis for decentralized hosting , 2009, Comput. Networks.

[3]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[6]  Chandandeep Singh Pabla Completely fair scheduler , 2009 .

[7]  Amin Vahdat,et al.  Chronos: predictable low latency for data center applications , 2012, SoCC '12.

[8]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[9]  Srikanth Kandula,et al.  Speeding up distributed request-response workflows , 2013, SIGCOMM.

[10]  Muli Ben-Yehuda,et al.  IsoStack - Highly Efficient Network Processing on Dedicated Cores , 2010, USENIX Annual Technical Conference.

[11]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[12]  Antony I. T. Rowstron,et al.  Better never than late: meeting deadlines in datacenter networks , 2011, SIGCOMM.

[13]  Calton Pu,et al.  Detecting Transient Bottlenecks in n-Tier Applications through Fine-Grained Analysis , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[14]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  Adam Wierman,et al.  Open Versus Closed: A Cautionary Tale , 2006, NSDI.

[17]  Peter Druschel,et al.  Lazy receiver processing (LRP): a network subsystem architecture for server systems , 1996, OSDI '96.

[18]  Srikanth Kandula,et al.  Jockey: guaranteed job latency in data parallel clusters , 2012, EuroSys '12.

[19]  Mor Harchol-Balter,et al.  Distributed, Robust Auto-Scaling Policies for Power Management in Compute Intensive Server Farms , 2011, 2011 Sixth Open Cirrus Summit.

[20]  Ricardo Bianchini,et al.  DeepDive: Transparently Identifying and Managing Performance Interference in Virtualized Environments , 2013, USENIX Annual Technical Conference.

[21]  T. N. Vijaykumar,et al.  Deadline-aware datacenter tcp (D2TCP) , 2012, CCRV.

[22]  Xue Liu,et al.  Power-Saving Design for Server Farms with Response Time Percentile Guarantees , 2012, 2012 IEEE 18th Real Time and Embedded Technology and Applications Symposium.

[23]  Brian D. Noble,et al.  Bobtail: Avoiding Long Tails in the Cloud , 2013, NSDI.

[24]  Robert Tappan Morris,et al.  Event-driven programming for robust software , 2002, EW 10.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Christo Wilson,et al.  Better never than late , 2011, SIGCOMM 2011.

[27]  Christopher Stewart,et al.  Zoolander: Efficiently Meeting Very Strict, Low-Latency SLOs , 2013, ICAC.

[28]  T. N. Vijaykumar,et al.  Deadline-aware datacenter tcp (D2TCP) , 2012, SIGCOMM '12.

[29]  G. Cox,et al.  ~ " " " ' l I ~ " " -" . : -· " J , 2006 .

[30]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[31]  David A. Maltz,et al.  DCTCP: Efficient Packet Transport for the Commoditized Data Center , 2010 .

[32]  Eric A. Brewer,et al.  USENIX Association Proceedings of HotOS IX : The 9 th Workshop on Hot Topics in Operating Systems , 2003 .

[33]  Jonathan Walpole,et al.  Supporting time-sensitive applications on a commodity OS , 2002, OPSR.

[34]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.