10-millisecond Computing

Despite computation becomes much complex on data with unprecedented large-scale, we argue computers or smart devices should and will consistently provide information and knowledge to human being in the order of a few tens milliseconds. We coin a new term 10-millisecond computing to call attention to this class of workloads. Public reports indicate that internet service users are sensitive to the service or job-level response time outliers, so we propose a very simple but powerful metric-outlier proportion to characterize the system behaviors.The outlier proportion is defined as follows: for N completed requests or jobs, if M jobs or requests' latencies exceed the outlier limit t, e.g. 10 milliseconds, the outlier proportion is M/N. 10-millisecond computing raises many challenges for both software and hardware stacks. In this paper, as a case study we investigate the challenges raised for conventional operating systems. For typical latency-critical services running with Linux on a 40-core server - a main-stream server hardware system in near future, we found, when the outlier limit decreases, the outlier proportion of a single server will significantly deteriorate. Meanwhile, the outlier proportion is further amplified by the system scale, including the system core number. For a 1K-scale system, we surprisingly find that to reduce the service or job-level outlier proportion to 10%, running Linux (version 2.6.32) or LXC (version 0.7.5 ) or XEN (version 4.0.0), respectively, the outlier proportion of a single server needs to be reduced by 871X, 2372X, 2372X accordingly. We also conducted a list of experiments to reveal the current Linux systems still suffer from poor outlier performance, including Linux kernel version 3.17.4, Linux kernel version 2.6.35M, a modified version of 2.6.35 integrated with sloppy counters and two representative real time schedulers.

[1]  Albert G. Greenberg,et al.  Reining in the Outliers in Map-Reduce Clusters using Mantri , 2010, OSDI.

[2]  Hai Jin,et al.  LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[3]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[4]  Xiaofeng Meng,et al.  You can stop early with COLA: online processing of aggregate queries in the cloud , 2012, CIKM.

[5]  Brighten Godfrey,et al.  More is less: reducing latency via redundancy , 2012, HotNets-XI.

[6]  Jialin Li,et al.  Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency , 2014, SoCC.

[7]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[8]  Gang Lu,et al.  On Horizontal Decomposition of the Operating System , 2016, ArXiv.

[9]  Michael Barabanov,et al.  A Linux-based Real-Time Operating System , 1997 .

[10]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[11]  Scott Shenker,et al.  The Case for Tiny Tasks in Compute Clusters , 2013, HotOS.

[12]  Jon Howell,et al.  Flat Datacenter Storage , 2012, OSDI.

[13]  Carlo Zaniolo,et al.  Early Accurate Results for Advanced Analytics on MapReduce , 2012, Proc. VLDB Endow..

[14]  Srikanth Kandula,et al.  Speeding up distributed request-response workflows , 2013, SIGCOMM.

[15]  Seunghak Lee,et al.  Solving the Straggler Problem with Bounded Staleness , 2013, HotOS.

[16]  Arkady Kanevsky,et al.  Remote Direct Memory Access over the Converged Enhanced Ethernet Fabric: Evaluating the Options , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[17]  Anees Shaikh,et al.  Performance Isolation and Fairness for Multi-Tenant Cloud Storage , 2012, OSDI.

[18]  Allen Newell,et al.  The psychology of human-computer interaction , 1983 .

[19]  Mendel Rosenblum,et al.  It's Time for Low Latency , 2011, HotOS.

[20]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[21]  Sameh Elnikety,et al.  Tians Scheduling: Using Partial Processing in Best-Effort Applications , 2011, 2011 31st International Conference on Distributed Computing Systems.

[22]  Nikolas Ioannou,et al.  On The [Ir]relevance of Network Performance for Data Processing , 2016, HotCloud.

[23]  James R. Larus,et al.  Zeta: scheduling interactive services with partial execution , 2012, SoCC '12.

[24]  Avi Mendelson,et al.  DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[25]  Larry L. Peterson,et al.  Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors , 2007, EuroSys '07.

[26]  Jianfeng Zhan,et al.  "Isolate First, Then Share": a New OS Architecture for the Worst-case Performance , 2016 .

[27]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.

[28]  Magdalena Balazinska,et al.  Skew-resistant parallel processing of feature-extracting scientific user-defined functions , 2010, SoCC '10.

[29]  Daniel Sánchez,et al.  Tailbench: a benchmark suite and evaluation methodology for latency-critical applications , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[30]  Mendel Rosenblum,et al.  Network Interface Design for Low Latency Request-Response Protocols , 2013, USENIX ATC.

[31]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[32]  Byung-Gon Chun,et al.  Augmented Smartphone Applications Through Clone Cloud Execution , 2009, HotOS.

[33]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[34]  David Chisnall,et al.  The Definitive Guide to the Xen Hypervisor , 2007 .

[35]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[36]  Dhabaleswar K. Panda,et al.  EMP: Zero-Copy OS-Bypass NIC-Driven Gigabit Ethernet Message Passing , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[38]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[39]  Christopher Stewart,et al.  Zoolander: Efficiently Meeting Very Strict, Low-Latency SLOs , 2013, ICAC.

[40]  Fang Dong,et al.  A Sampling-Based Hybrid Approximate Query Processing System in the Cloud , 2014, 2014 43rd International Conference on Parallel Processing.

[41]  Peter Druschel,et al.  Resource containers: a new facility for resource management in server systems , 1999, OSDI '99.

[42]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[43]  Scott Shenker,et al.  Network support for resource disaggregation in next-generation datacenters , 2013, HotNets.

[44]  Chris Jermaine,et al.  Online aggregation for large MapReduce jobs , 2011, Proc. VLDB Endow..

[45]  Scott Shenker,et al.  Usenix Association 10th Usenix Symposium on Networked Systems Design and Implementation (nsdi '13) 185 Effective Straggler Mitigation: Attack of the Clones , 2022 .

[46]  Daniel Sánchez,et al.  Ubik: efficient cache sharing with strict qos for latency-critical workloads , 2014, ASPLOS.

[47]  Scott Shenker,et al.  Network Requirements for Resource Disaggregation , 2016, OSDI.

[48]  Amin Vahdat,et al.  Less Is More: Trading a Little Bandwidth for Ultra-Low Latency in the Data Center , 2012, NSDI.

[49]  Jock D. Mackinlay,et al.  The information visualizer, an information workspace , 1991, CHI.