Resilient workload manager: taming bursty workload of scaling internet applications

In data centers hosting scaling Internet applications, operators face the tradeoff dilemma between resource efficiency and Quality of Service (QoS), and the root cause lies in workload dynamics. In this paper, we address the problem with the design of Resilient Workload Manager (ROM). As a comprehensive workload management framework, ROM covers workload, data, resource, and Quality of Service of the target applications. Its basic idea lies in explicitly segregating base workload and trespassing workload, the two naturally different components in application workload, and managing them separately in two resource zones with specialized optimization techniques. The ROM prototype was implemented as a layer-7 load balancer for a video streaming service testbed, which consits of a local compute cluster serving as the base load zone and the Amazon EC2 infrastructure as the trespassing zone.

[1]  Azer Bestavros,et al.  GreedyDual* Web caching algorithm: exploiting the two sources of temporal locality in Web request streams , 2001, Comput. Commun..

[2]  Dirk Abendroth,et al.  Intelligent shaping: well shaped throughout the entire network? , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[3]  Konstantinos Psounis,et al.  CHOKe - a stateless active queue management scheme for approximating fair bandwidth allocation , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[4]  Abhishek Kumar,et al.  Data streaming algorithms for efficient and accurate estimation of flow size distribution , 2004, SIGMETRICS '04/Performance '04.

[5]  Dejan S. Milojicic,et al.  SLA Decomposition: Translating Service Level Objectives to System Level Thresholds , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[6]  Richard Wolski,et al.  Quorum: flexible quality of service for internet services , 2005, NSDI.

[7]  Ludmila Cherkasova,et al.  Measuring the capacity of a streaming media server in a Utility Data Center environment , 2002, MULTIMEDIA '02.

[8]  Ludmila Cherkasova,et al.  An SLA-oriented capacity planning tool for streaming media services , 2004, International Conference on Dependable Systems and Networks, 2004.

[9]  Jerome A. Rolia,et al.  An integrated approach to resource pool management: Policies, efficiency and quality metrics , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[10]  Fang Hao,et al.  Fast payload-based flow estimation for traffic monitoring and network security , 2005, 2005 Symposium on Architectures for Networking and Communications Systems (ANCS).

[11]  Mor Harchol-Balter,et al.  On Choosing a Task Assignment Policy for a Distributed Server System , 1998, J. Parallel Distributed Comput..

[12]  Maurizio Portolani,et al.  Data Center Fundamentals , 2003 .

[13]  Sally Floyd,et al.  Wide-Area Traffic: The Failure of Poisson Modeling , 1994, SIGCOMM.

[14]  Qi Zhang,et al.  A Capacity Planning Framework for Multi-tier Enterprise Services with Real Workloads , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[15]  Tao Yang,et al.  Integrated resource management for cluster-based Internet services , 2002, OSDI.

[16]  Mark S. Squillante,et al.  On maximizing service-level-agreement profits , 2001, PERV.

[17]  David A. Patterson,et al.  A Case For Adaptive Datacenters To Conserve Energy and Improve Reliability , 2008 .

[18]  Matthias Grossglauser,et al.  Trajectory sampling for direct traffic observation , 2000, SIGCOMM 2000.

[19]  Lei Gao,et al.  Application specific data replication for edge services , 2003, WWW '03.

[20]  Haifeng Chen,et al.  Understanding Internet Video sharing site workload: A view from data center design , 2010, J. Vis. Commun. Image Represent..

[21]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Vipin Kumar,et al.  Multilevel k-way hypergraph partitioning , 1999, DAC '99.

[23]  Ness B. Shroff,et al.  Admission control for statistical QoS: theory and practice , 1999, IEEE Netw..

[24]  Ward Whitt,et al.  Partitioning Customers Into Service Groups , 1999 .

[25]  Prashant J. Shenoy,et al.  Dynamic Provisioning of Multi-tier Internet Applications , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[26]  Ricardo Bianchini,et al.  Dynamic cluster reconfiguration for power and performance , 2003 .

[27]  Michele Colajanni,et al.  Content-Aware Dispatching Algorithms for Cluster-Based Web Servers , 2004, Cluster Computing.

[28]  Eugene Ciurana,et al.  Google App Engine , 2009 .

[29]  J. Turner,et al.  New directions in communications (or which way to the information age?) , 1986, IEEE Communications Magazine.

[30]  Alec Wolman,et al.  On the scale and performance of cooperative Web proxy caching , 1999, SOSP.

[31]  Prashant J. Shenoy,et al.  Cataclysm: Scalable overload policing for internet applications , 2008, J. Netw. Comput. Appl..