Workload characterization of interactive cloud services on big and small server platforms

Key-value stores (e.g., Memcached) and web servers (e.g., NGINX) are widely used by cloud providers. As interactive services, they have strict service-level objectives, with typical 99th-percentile tail latencies on the order of a few milliseconds. Unlike average latency, tail latency is more sensitive to changes in usage load and traffic patterns, system configurations, and resource availability. Understanding the sensitivity of tail latency to application and system factors is critical to efficiently design and manage systems for these latency-critical services. We present a comprehensive study of the impact a diverse set of application, hardware, and isolation configurations have on tail latency for two representative interactive services, Memcached and NGINX. Examined factors include input load, thread-level parallelism, request size, virtualization, and resource partitioning. We conduct this study on two server platforms with significant differences in terms of architecture and price points: an Intel Xeon and an ARM-based Cavium ThunderX server. Experimental results show that latency on both platforms is subject to changes of several orders of magnitude depending on application and system settings, with Cavium ThunderX being more sensitive to configuration parameters.

[1]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[2]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[3]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[4]  Christoforos E. Kozyrakis,et al.  Vantage: Scalable and efficient fine-grain cache partitioning , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[5]  Kushagra Vaid,et al.  Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[6]  David A. Koufaty,et al.  Hyperthreading Technology in the Netburst Microarchitecture , 2003, IEEE Micro.

[7]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[8]  Daniel Sánchez,et al.  Rubik: Fast analytical power management for latency-critical systems , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Hao Che,et al.  Wimpy or brawny cores: A throughput perspective , 2013, J. Parallel Distributed Comput..

[10]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[11]  Beng Chin Ooi,et al.  A Performance Study of Big Data on Small Nodes , 2015, Proc. VLDB Endow..

[12]  Lingjia Tang,et al.  Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers , 2013, ISCA.

[13]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[14]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[15]  Urs Hölzle,et al.  Brawny cores still beat wimpy cores, most of the time , 2010 .

[16]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[17]  Lakshmish Ramaswamy,et al.  Cache Clouds: Cooperative Caching of Dynamic Documents in Edge Networks , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[18]  Sriram Sankar,et al.  Server Engineering Insights for Large-Scale Online Services , 2010, IEEE Micro.

[19]  L. Geppert Sun's big splash [Niagara microprocessor chip] , 2005, IEEE Spectrum.

[20]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[21]  Trevor N. Mudge,et al.  Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[22]  Mattan Erez,et al.  Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems , 2016, ASPLOS.

[23]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[24]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[25]  René W. Schmidt,et al.  vApp: a standards-based container for cloud providers , 2010, OPSR.

[26]  Fang-Yie Leu,et al.  Cache Replacement Algorithms for YouTube , 2014, 2014 IEEE 28th International Conference on Advanced Information Networking and Applications.

[27]  James E. Smith,et al.  Virtual machines - versatile platforms for systems and processes , 2005 .

[28]  Adam Wierman,et al.  Open Versus Closed: A Cautionary Tale , 2006, NSDI.

[29]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[30]  Lingjia Tang,et al.  Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[31]  Pradeep Dubey,et al.  Architecting to achieve a billion requests per second throughput on a single key-value store server platform , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[32]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[33]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[34]  Debbie Marr,et al.  Hyper-Threading Technology in the NetburstTM Microarchitecture , 2013 .

[35]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[36]  Herbert Bos,et al.  File size distribution on UNIX systems: then and now , 2006, OPSR.

[37]  Kunle Olukotun,et al.  Maximizing CMP throughput with mediocre cores , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[38]  Babak Falsafi,et al.  From embedded multi-core SoCs to scale-out processors , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).