On the Performance Variability of Production Cloud Services

Cloud computing is an emerging infrastructure paradigm that promises to eliminate the need for companies to maintain expensive computing hardware. Through the use of virtualization and resource time-sharing, clouds address with a single set of physical resources a large user base with diverse needs. Thus, clouds have the potential to provide their owners the benefits of an economy of scale and, at the same time, become an alternative for both the industry and the scientific community to self-owned clusters, grids, and parallel production environments. For this potential to become reality, the first generation of commercial clouds need to be proven to be dependable. In this work we analyze the dependability of cloud services. Towards this end, we analyze long-term performance traces from Amazon Web Services and Google App Engine, currently two of the largest commercial clouds in production. We find that the performance of about half of the cloud services we investigate exhibits yearly and daily patterns, but also that most services have periods of especially stable performance. Last, through trace-based simulation we assess the impact of the variability observed for the studied cloud services on three large-scale applications, job execution in scientific computing, virtual goods trading in social networks, and state management in social gaming. We show that the impact of performance variability depends on the application, and give evidence that performance variability can be an important factor in cloud provider selection.

[1]  Miron Livny,et al.  The Available Capacity of a Privately Owned Workstation Environmont , 1991, Perform. Evaluation.

[2]  John Zahorjan,et al.  Scheduling a mixed interactive and batch workload on a parallel, shared memory supercomputer , 1992, Proceedings Supercomputing '92.

[3]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[4]  Uwe Schwiegelshohn,et al.  Theory and Practice in Parallel Job Scheduling , 1997, JSSPP.

[5]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[6]  HarrisTim,et al.  Xen and the art of virtualization , 2003 .

[7]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[8]  Eli M. Dow,et al.  Xen and the Art of Repeated Research , 2004, USENIX Annual Technical Conference, FREENIX Track.

[9]  Ludmila Cherkasova,et al.  Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor , 2005, USENIX ATC, General Track.

[10]  Willy Zwaenepoel,et al.  Diagnosing performance overheads in the xen virtual machine environment , 2005, VEE '05.

[11]  Kajal T. Claypool,et al.  Latency and player actions in online games , 2006, CACM.

[12]  Matei Ripeanu,et al.  Amazon S3 for science grids: a viable solution? , 2008, DADC '08.

[13]  Alexandru Iosup,et al.  Efficient management of data center resources for Massively Multiplayer Online Games , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Ashraf Aboulnaga,et al.  Database systems on virtual machines: How much do you lose? , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[15]  Jeffrey S. Vetter,et al.  Xen-Based HPC: A Parallel I/O Perspective , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[16]  Alexandru Iosup,et al.  An Early Performance Analysis of Cloud Computing Services for Scientific Computing , 2008 .

[17]  Edward Walker,et al.  Benchmarking Amazon EC2 for High-Performance Scientific Computing , 2008, login Usenix Mag..

[18]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[19]  M. Livny,et al.  The cost of doing science on the cloud: The Montage example , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Alexandru Iosup,et al.  The performance of bags-of-tasks in large-scale distributed systems , 2008, HPDC '08.

[21]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[22]  Kuan-Ta Chen,et al.  Effect of Network Quality on Player Departure Behavior in Online Games , 2009, IEEE Transactions on Parallel and Distributed Systems.

[23]  David Hilley,et al.  Cloud Computing: A Taxonomy of Platform and Infrastructure-level Offerings , 2009 .

[24]  Tal Garfinkel,et al.  Virtual machine contracts for datacenter and cloud computing environments , 2009, ACDC '09.

[25]  Radu Prodan,et al.  A survey and taxonomy of infrastructure as a service and web hosting cloud providers , 2009, 2009 10th IEEE/ACM International Conference on Grid Computing.

[26]  Guillaume Pierre,et al.  EC2 Performance Analysis for Resource Provisioning of Service-Oriented Applications , 2009, ICSOC/ServiceWave Workshops.

[27]  Alexandru Iosup,et al.  C-Meter: A Framework for Performance Analysis of Computing Clouds , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[28]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[29]  Alexandru Iosup,et al.  Trace-based evaluation of job runtime and queue wait time predictions in grids , 2009, HPDC '09.

[30]  Shantenu Jha,et al.  Exploring the Performance Fluctuations of HPC Workloads on Clouds , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[31]  Alexandru Iosup,et al.  Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.