Reducing deadline miss rate for grid workloads running in virtual machines : a deadline-aware and adaptive approach

This thesis explores three major areas of research; integration of virutalization into scientific grid infrastructures, evaluation of the virtualization overhead on HPC grid job’s performance, and optimization of job execution times to increase their throughput by reducing job deadline miss rate. Integration of the virtualization into the grid to deploy on-demand virtual machines for jobs in a way that is transparent to the end users and have minimum impact on the existing system poses a significant challenge. This involves the creation of virtual machines, decompression of the operating system image, adapting the virtual environment to satisfy software requirements of the job, constant update of the job state once it’s running with out modifying batch system or existing grid middleware, and finally bringing the host machine back to a consistent state. To facilitate this research, an existing and in production pilot job framework has been modified to deploy virtual machines on demand on the grid using virtualization administrative domain to handle all I/O to increase network throughput. This approach limits the change impact on the existing grid infrastructure while leveraging the execution and performance isolation capabilities of virtualization for job execution. This work led to evaluation of various scheduling strategies used by the Xen hypervisor to measure the sensitivity of job performance to the amount of CPU and memory allocated under various configurations. However, virtualization overhead is also a critical factor in determining job execution times. Grid jobs have a diverse set of requirements for machine resources such as CPU, Memory, Network and have inter-dependencies on other jobs in meeting their deadlines since the input of one job can be the output from the previous job. A novel resource provisioning model was devised to decrease the impact of virtualization overhead on job execution. Finally, dynamic deadline-aware optimization algorithms were introduced using exponential smoothing and rate limiting to predict job failure rates based on static and dynamic virtualization overhead. Statistical techniques were also integrated into the optimization algorithm to flag jobs that are at risk to miss their deadlines, and taking preventive action to increase overall job throughput.

[1]  Nagarajan Kandasamy,et al.  Power and performance management of virtualized computing environments via lookahead control , 2008, 2008 International Conference on Autonomic Computing.

[2]  Trent Jaeger,et al.  Design and Implementation of a TCG-based Integrity Measurement Architecture , 2004, USENIX Security Symposium.

[3]  Peter A. Dinda,et al.  VSched: Mixing Batch And Interactive Virtual Machines Using Periodic Real-time Scheduling , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[4]  Fred Douglis,et al.  Transparent process migration: Design alternatives and the sprite implementation , 1991, Softw. Pract. Exp..

[5]  Paul Anderson,et al.  Low Carbon Computing: a view to 2050 and beyond , 2009 .

[6]  Margo I. Seltzer,et al.  An architecture a day keeps the hacker away , 2005, CARN.

[7]  Daniel Price,et al.  Solaris Zones: Operating System Support for Consolidating Commercial Workloads , 2004, LISA.

[8]  Anja Feldmann,et al.  Live wide-area migration of virtual machines including local persistent state , 2007, VEE '07.

[9]  Amin Vahdat,et al.  Enforcing Performance Isolation Across Virtual Machines in Xen , 2006, Middleware.

[10]  Borja Sotomayor,et al.  Virtual Infrastructure Management in Private and Hybrid Clouds , 2009, IEEE Internet Computing.

[11]  Barton P. Miller,et al.  Playing Inside the Black Box: Using Dynamic Instrumentation to Create Security Holes , 2001, Parallel Process. Lett..

[12]  Frank Bellosa,et al.  Energy Management for Hypervisor-Based Virtual Machines , 2007, USENIX Annual Technical Conference.

[13]  Ludmila Cherkasova,et al.  Measuring CPU Overhead for I/O Processing in the Xen Virtual Machine Monitor , 2005, USENIX ATC, General Track.

[14]  Richard Wolski,et al.  The Eucalyptus Open-Source Cloud-Computing System , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[15]  Beng-Hong Lim,et al.  Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor , 2001, USENIX Annual Technical Conference, General Track.

[16]  Chandrakant D. Patel,et al.  On building next generation data centers: energy flow in the information technology stack , 2008, Bangalore Compute Conf..

[17]  David H. Ackley,et al.  Building diverse computer systems , 1997, Proceedings. The Sixth Workshop on Hot Topics in Operating Systems (Cat. No.97TB100133).

[18]  Robert P. Goldberg,et al.  Survey of virtual machine research , 1974, Computer.

[19]  Roberto Di Pietro,et al.  KvmSec: a security extension for Linux kernel virtual machines , 2009, SAC '09.

[20]  Anthony Nocentino,et al.  Toward dependency-aware live virtual machine migration , 2009, VTDC '09.

[21]  Katarzyna Keahey,et al.  Contextualization: Providing One-Click Virtual Clusters , 2008, 2008 IEEE Fourth International Conference on eScience.

[22]  Alan L. Cox,et al.  Scheduling I/O in virtual machine monitors , 2008, VEE '08.

[23]  Rajkumar Buyya,et al.  Article in Press Future Generation Computer Systems ( ) – Future Generation Computer Systems Cloud Computing and Emerging It Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility , 2022 .

[24]  Tal Garfinkel,et al.  Virtual machine monitors: current technology and future trends , 2005, Computer.

[25]  Paul Marshall,et al.  Elastic Site: Using Clouds to Elastically Extend Site Resources , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[26]  Barton P. Miller,et al.  Process migration in DEMOS/MP , 1983, SOSP '83.

[27]  Munindar P. Singh,et al.  Service-Oriented Computing: Key Concepts and Principles , 2005, IEEE Internet Comput..

[28]  Robert N. M. Watson,et al.  Jails: confining the omnipotent root , 2000 .

[29]  Mike Murphy,et al.  The Efficacy of Live Virtual Machine Migrations Over the Internet , 2007, Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing (VTDC '07).

[30]  Jeffrey S. Chase,et al.  Making Scheduling "Cool": Temperature-Aware Workload Placement in Data Centers , 2005, USENIX Annual Technical Conference, General Track.

[31]  Andrew A. Chien,et al.  Entropia: architecture and performance of an enterprise desktop grid system , 2003, J. Parallel Distributed Comput..

[32]  H. Spiifford,et al.  Crisis and Aftermath , 1999 .

[33]  Satoshi Sekiguchi,et al.  A live storage migration mechanism over wan and its performance evaluation , 2009, VTDC '09.

[34]  Robert J. Creasy,et al.  The Origin of the VM/370 Time-Sharing System , 1981, IBM J. Res. Dev..

[35]  David R. Cheriton,et al.  Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler , 1999, OPSR.

[36]  Thomas Wicki,et al.  Performance limiting factors in http (Web) server operations , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[37]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[38]  Anoop Gupta,et al.  Performance isolation: sharing and isolation in shared-memory multiprocessors , 1998, ASPLOS VIII.

[39]  Thomas Sandholm,et al.  What's inside the Cloud? An architectural map of the Cloud landscape , 2009, 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing.

[40]  Gregory A. Koenig,et al.  Maestro-VC: a paravirtualized execution environment for secure on-demand cluster computing , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[41]  Jason Nieh,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation , 2022 .

[42]  Richard Wolski,et al.  Dynamically forecasting network performance using the Network Weather Service , 1998, Cluster Computing.

[43]  Narayan Desai,et al.  A Scalable Approach to Deploying and Managing Appliances , 2007 .

[44]  Willy Zwaenepoel,et al.  Diagnosing performance overheads in the xen virtual machine environment , 2005, VEE '05.

[45]  Anders Hast,et al.  Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop , 2009 .

[46]  David E. Irwin,et al.  Ensemble-level Power Management for Dense Blade Servers , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[47]  Amin Vahdat,et al.  Dynamic Scheduling of Virtual Machines Running HPC Workloads in Scientific Grids , 2007, 2009 3rd International Conference on New Technologies, Mobility and Security.

[48]  Renato J. O. Figueiredo,et al.  I/O processing in a virtualized platform: a simulation-driven approach , 2007, VEE '07.

[49]  Larry L. Peterson,et al.  Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors , 2007, EuroSys '07.

[50]  Xuxian Jiang,et al.  Virtual distributed environments in a shared infrastructure , 2005, Computer.

[51]  Gil Neiger,et al.  Intel virtualization technology , 2005, Computer.

[52]  Sunay Tripathi,et al.  Crossbow: from hardware virtualized NICs to virtualized networks , 2009, VISA '09.

[53]  Xianghua Xu,et al.  Performance Evaluation of the CPU Scheduler in XEN , 2008, 2008 International Symposium on Information Science and Engineering.

[54]  Daniel C. Stanzione,et al.  Dynamic Virtual Clustering , 2007, 2007 IEEE International Conference on Cluster Computing.

[55]  Andrea C. Arpaci-Dusseau,et al.  Deploying Virtual Machines as Sandboxes for the Grid , 2005, WORLDS.

[56]  Irfan Habib,et al.  Tools and Techniques for Managing Virtual Machine Images , 2008, Euro-Par Workshops.

[57]  Rajkumar Buyya,et al.  Modeling and simulation of scalable Cloud computing environments and the CloudSim toolkit: Challenges and opportunities , 2009, 2009 International Conference on High Performance Computing & Simulation.

[58]  R. E. Wheeler Statistical distributions , 1983, APLQ.

[59]  Edward R. Zayas,et al.  Attacking the process migration bottleneck , 1987, SOSP '87.

[60]  Monica S. Lam,et al.  Virtual Appliances in the Collective: A Road to Hassle-Free Computing , 2003, HotOS.

[61]  Jeanne W. Ross,et al.  Preparing for utility computing: The role of IT architecture and relationship management , 2004, IBM Syst. J..

[62]  Miltos Petridis,et al.  Deadline Aware Virtual Machine Scheduler for Grid and Cloud Computing , 2010, 2010 IEEE 24th International Conference on Advanced Information Networking and Applications Workshops.

[63]  Heinrich Meyr,et al.  Digital communication receivers - synchronization, channel estimation, and signal processing , 1997, Wiley series in telecommunications and signal processing.

[64]  Kartik Gopalan,et al.  Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning , 2009, VEE '09.

[65]  Eduardo Huedo,et al.  Management of Virtual Machines on Globus Grids Using GridWay , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[66]  L. Evans The large hadron collider : a marvel of technology , 2009 .

[67]  Hidayatullah Shaikh,et al.  Desktop to cloud transformation planning , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[68]  A. Goshaw The ATLAS Experiment at the CERN Large Hadron Collider , 2008 .

[69]  Robin Fairbairns,et al.  The Design and Implementation of an Operating System to Support Distributed Multimedia Applications , 1996, IEEE J. Sel. Areas Commun..

[70]  Jiuxing Liu,et al.  Virtualization polling engine (VPE): using dedicated CPU cores to accelerate I/O virtualization , 2009, ICS.

[71]  Rajkumar Buyya,et al.  High Performance Cluster Computing , 1999 .

[72]  Jerome Lauret,et al.  Virtual workspaces for scientific applications. , 2007 .

[73]  Joel Closier,et al.  DIRAC: a community grid solution , 2008 .

[74]  David E. Irwin,et al.  Virtual Machine Hosting for Networked Clusters: Building the Foundations for "Autonomic" Orchestration , 2006, First International Workshop on Virtualization Technology in Distributed Computing (VTDC 2006).

[75]  Rajkumar Buyya,et al.  High Performance Cluster Computing: Architectures and Systems , 1999 .

[76]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[77]  Ludek Matyska,et al.  Scheduling Virtual Grids: The Magrathea System , 2007, Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing (VTDC '07).

[78]  Jeffrey S. Chase,et al.  Secure control of portable images in a virtual computing utility , 2008, VMSec '08.

[79]  Akshat Verma,et al.  Power-aware dynamic placement of HPC applications , 2008, ICS '08.

[80]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[81]  Daniel C. Stanzione,et al.  Dynamic Virtual Clustering with Xen and Moab , 2006, ISPA Workshops.

[82]  David E. Irwin,et al.  Dynamic virtual clusters in a grid site manager , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[83]  Hussein M. Abdel-Wahab,et al.  A proportional share resource allocation algorithm for real-time, time-shared systems , 1996, 17th IEEE Real-Time Systems Symposium.

[84]  Hovav Shacham,et al.  Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds , 2009, CCS.

[85]  Jon Feldman,et al.  A Truthful Mechanism for Offline Ad Slot Scheduling , 2008, SAGT.

[86]  Franck Cappello,et al.  Cost-benefit analysis of Cloud Computing versus desktop grids , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[87]  Kevin Borders,et al.  SVGrid: a secure virtual environment for untrusted grid applications , 2005, MGC '05.

[88]  Rafael Moreno-Vozmediano,et al.  Elastic management of cluster-based services in the cloud , 2009, ACDC '09.

[89]  T Maeno,et al.  PanDA: distributed production and distributed analysis system for ATLAS , 2008 .

[90]  Ian T. Foster,et al.  From sandbox to playground: dynamic virtual environments in the grid , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[91]  Rajkumar Buyya,et al.  Offer-based scheduling of deadline-constrained Bag-of-Tasks applications for utility computing systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[92]  Miltos Petridis,et al.  Deadline aware virtual machine scheduler for scientific grids and cloud computing , 2010, ArXiv.

[93]  Karsten Schwan,et al.  VirtualPower: coordinated power management in virtualized enterprise systems , 2007, SOSP.

[94]  Renato J. O. Figueiredo,et al.  A case for grid computing on virtual machines , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[95]  Xuxian Jiang,et al.  Tracing Worm Break-In and Contaminations via Process Coloring: A Provenance-Preserving Approach , 2008, IEEE Transactions on Parallel and Distributed Systems.

[96]  Judy Kay,et al.  A fair share scheduler , 1988, CACM.

[97]  Miltos Petridis,et al.  Dynamic Scheduling of Virtual Machines Running HPC Workloads in Scientific Grids , 2009, 2009 3rd International Conference on New Technologies, Mobility and Security.

[98]  L. Ramakrishnan,et al.  Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[99]  Alexander Stage,et al.  Network-aware migration control and scheduling of differentiated virtual machine workloads , 2009, 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing.

[100]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[101]  David E. Irwin,et al.  Sharing Networked Resources with Brokered Leases , 2006, USENIX Annual Technical Conference, General Track.

[102]  P. Altena,et al.  In search of clusters , 2007 .

[103]  P. Mell,et al.  The NIST Definition of Cloud Computing , 2011 .

[104]  Tal Garfinkel,et al.  A Virtual Machine Introspection Based Architecture for Intrusion Detection , 2003, NDSS.

[105]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[106]  Eyal de Lara,et al.  SnowFlock: rapid virtual machine cloning for cloud computing , 2009, EuroSys '09.

[107]  Angela C. Sodan,et al.  Adaptive Scheduling for QoS Virtual Machines under Different Resource Allocation - Performance Effects and Predictability , 2009, JSSPP.

[108]  Ian T. Foster,et al.  Virtual workspaces: Achieving quality of service and quality of life in the Grid , 2005, Sci. Program..

[109]  Renato J. O. Figueiredo,et al.  Grid-computing portals and security issues , 2003, J. Parallel Distributed Comput..