论文信息 - Distributed resource management across process boundaries - 字舞流文

Distributed resource management across process boundaries

Multi-tenant distributed systems composed of small services, such as Service-oriented Architectures (SOAs) and Micro-services, raise new challenges in attaining high performance and efficient resource utilization. In these systems, a request execution spans tens to thousands of processes, and the execution paths and resource demands on different services are generally not known when a request first enters the system. In this paper, we highlight the fundamental challenges of regulating load and scheduling in SOAs while meeting end-to-end performance objectives on metrics of concern to both tenants and operators. We design Wisp, a framework for building SOAs that transparently adapts rate limiters and request schedulers system-wide according to operator policies to satisfy end-to-end goals while responding to changing system conditions. In evaluations against production as well as synthetic workloads, Wisp successfully enforces a range of end-to-end performance objectives, such as reducing average latencies, meeting deadlines, providing fairness and isolation, and avoiding system overload.

Florin Ciucu | Marco Canini | Ishai Menache | Peter Bodík | Lalith Suresh | P. Bodík | Ishai Menache | M. Canini | L. Suresh | F. Ciucu | Marco Canini

[1] Xiaohui Gu,et al. AGILE: Elastic Distributed Resource Scaling for Infrastructure-as-a-Service , 2013, ICAC.

[2] Scott Shenker,et al. Adaptive Stream Processing using Dynamic Batch Sizing , 2014, SoCC.

[3] Benjamin Hindman,et al. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[4] Scott Shenker,et al. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[5] Nick McKeown,et al. pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[6] Ju Wang,et al. Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[7] Hitesh Ballani,et al. End-to-end Performance Isolation Through Virtual Datacenters , 2014, OSDI.

[8] Srikanth Kandula,et al. Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[9] I. Stoica,et al. FairCloud: sharing the network in cloud computing , 2011, CCRV.

[10] Peter J. Varman,et al. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation , 2014, FAST.

[11] David E. Culler,et al. Overload management as a fundamental service design primitive , 2002, EW 10.

[12] Peter R. Pietzuch,et al. THEMIS: Fairness in Federated Stream Processing under Overload , 2016, SIGMOD Conference.

[13] Wei Jin,et al. USENIX Association Proceedings of USITS ’ 03 : 4 th USENIX Symposium on Internet Technologies and Systems , 2003 .

[14] Mor Harchol-Balter,et al. Self-adaptive admission control policies for resource-sharing systems , 2009, SIGMETRICS '09.

[15] Boon Thau Loo,et al. Automated profiling and resource management of pig programs for meeting service level objectives , 2012, ICAC '12.

[16] Gautam Kumar,et al. Hold 'em or fold 'em?: aggregation queries under performance variations , 2016, EuroSys.

[17] Emmanuel Grolleau. Introduction to Real‐Time Scheduling , 2014 .

[18] McKeownNick,et al. Why flow-completion time is the right metric for congestion control , 2006 .

[19] Jignesh M. Patel,et al. Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[20] Antony I. T. Rowstron,et al. IOFlow: a software-defined storage architecture , 2013, SOSP.

[21] Justine Sherry,et al. Silo: Predictable Message Latency in the Cloud , 2015, Comput. Commun. Rev..

[22] Tony Tung,et al. Scaling Memcache at Facebook , 2013, NSDI.

[23] Peter J. Varman,et al. mClock: Handling Throughput Variability for Hypervisor IO Scheduling , 2010, OSDI.

[24] Jörg Widmer,et al. TCP-Friendly Multicast Congestion Control (TFMCC): Protocol Specification , 2006, RFC.

[25] Abhishek Verma,et al. Large-scale cluster management at Google with Borg , 2015, EuroSys.

[26] Rodrigo Fonseca,et al. Retro: Targeted Resource Management in Multi-tenant Distributed Systems , 2015, NSDI.

[27] Wei Lin,et al. Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[28] Sam Newman,et al. Building Microservices , 2015 .

[29] Marco Spuri,et al. Deadline Scheduling for Real-Time Systems: Edf and Related Algorithms , 2013 .

[30] Antony I. T. Rowstron,et al. Better never than late: meeting deadlines in datacenter networks , 2011, SIGCOMM.

[31] Donald F. Towsley,et al. Performance evaluation of two new disk scheduling algorithms for real-time systems , 2004, Real-Time Systems.

[32] Sriram Ramabhadran,et al. Cloud control with distributed rate limiting , 2007, SIGCOMM '07.

[33] Brighten Godfrey,et al. Finishing flows quickly with preemptive scheduling , 2012, CCRV.

[34] Amin Vahdat,et al. BwE: Flexible, Hierarchical Bandwidth Allocation for WAN Distributed Computing , 2015, Comput. Commun. Rev..

[35] Jun Li,et al. Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services , 2015, NSDI.

[36] Michael I. Jordan,et al. Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[37] Yingwei Luo,et al. Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[38] Hitesh Ballani,et al. Decentralized task-aware scheduling for data center networks , 2015, SIGCOMM 2015.

[39] Thomas Erl,et al. Service-Oriented Architecture: Concepts, Technology, and Design , 2005 .

[40] Mor Harchol-Balter,et al. Connection Scheduling in Web Servers , 1999, USENIX Symposium on Internet Technologies and Systems.

[41] Luiz André Barroso,et al. The tail at scale , 2013, CACM.

[42] Sujata Banerjee,et al. ElasticSwitch: practical work-conserving bandwidth guarantees for cloud computing , 2013, SIGCOMM.

[43] Evgenia Smirni,et al. AWAIT: Efficient Overload Management for Busy Multi-tier Web Services under Bursty Workloads , 2010, ICWE.

[44] George Varghese,et al. Efficient fair queueing using deficit round robin , 1995, SIGCOMM '95.

[45] Ion Stoica,et al. Efficient coflow scheduling with Varys , 2014, SIGCOMM.

[46] Randy H. Katz,et al. Cake: enabling high-level SLOs on shared storage systems , 2012, SoCC '12.

[47] Nick McKeown,et al. Why flow-completion time is the right metric for congestion control , 2006, CCRV.

[48] Srikanth Kandula,et al. Speeding up distributed request-response workflows , 2013, SIGCOMM.

[49] Zhenhuan Gong,et al. PRESS: PRedictive Elastic ReSource Scaling for cloud systems , 2010, 2010 International Conference on Network and Service Management.

[50] Anees Shaikh,et al. Performance Isolation and Fairness for Multi-Tenant Cloud Storage , 2012, OSDI.

[51] Srikanth Kandula,et al. Multi-resource packing for cluster schedulers , 2015, SIGCOMM.

[52] Eric A. Brewer,et al. Lessons from Giant-Scale Services , 2001, IEEE Internet Comput..

[53] Irfan Ahmad,et al. PARDA: Proportional Allocation of Resources for Distributed Storage Access , 2009, FAST.

[54] Robert N. M. Watson,et al. Queues Don't Matter When You Can JUMP Them! , 2015, NSDI.

[55] Werner Vogels,et al. Dynamo: amazon's highly available key-value store , 2007, SOSP.

[56] Rodrigo Fonseca,et al. 2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services , 2016, SIGCOMM.