Self-Tuning Virtual Machines for Predictable eScience

Unpredictable access to batch-mode HPC resources is a significant problem for emerging dynamic data-driven applications. Although efforts such as reservation or queue-time prediction have attempted to partially address this problem, the approaches strictly based on space-sharing impose fundamental limits on real-time predictability. In contrast, our earlier work investigated the use of feedback-controlled virtual machines (VMs), a time-sharing approach, to deliver predictable execution. However, our earlier work did not fully address usability and implementation efficiency. This paper presents an online, software-only version of feedback controlled VM, called self-tuning VM, which we argue is a practical approach for predictable HPC infrastructure. Our evaluation using five widely-used applications show our approach is both predictable and practical: by simply running time-dependent jobs with our tool, we meet a job’s deadline typically within 3% errors, and within 8% errors for the more challenging applications.

[1]  Jing Xu,et al.  On the Use of Fuzzy Modeling in Virtualized Data Center Management , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[2]  Ali Afzal,et al.  Making the Grid Predictable through Reservations and Performance Modelling , 2005, Comput. J..

[3]  Yuting Zhang,et al.  Friendly virtual machines: leveraging a feedback-control model for application adaptation , 2005, VEE '05.

[4]  Joseph A. Wolkan,et al.  Introduction to probability and statistics , 1994 .

[5]  J. Susan Milton,et al.  Introduction to Probability and Statistics: Principles and Applications for Engineering and the Computing Sciences , 1990 .

[6]  Borja Sotomayor,et al.  Combining batch execution and leasing using virtual machines , 2008, HPDC '08.

[7]  Sang-Min Park,et al.  Feedback-controlled resource sharing for predictable eScience , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Warren Smith,et al.  Scheduling with advanced reservations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[9]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[10]  Kaizar Amin,et al.  Analysis and Provision of QoS for Distributed Grid Applications , 2004, Journal of Grid Computing.

[11]  Richard Wolski,et al.  Predicting bounds on queuing delay for batch-scheduled parallel machines , 2006, PPoPP '06.

[12]  Ivan Beschastnikh,et al.  SPRUCE: A System for Supporting Urgent High-Performance Computing , 2006, Grid-Based Problem Solving Environments.

[13]  Sang Hyuk Son,et al.  Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers , 2006, IEEE Transactions on Parallel and Distributed Systems.

[14]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[15]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[16]  B. Anderson,et al.  Digital control of dynamic systems , 1981, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[17]  Sara J. Graves,et al.  Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD , 2005, International Conference on Computational Science.

[18]  W. J. DeCoursey,et al.  Introduction: Probability and Statistics , 2003 .

[19]  David E. Irwin,et al.  Sharing Networked Resources with Brokered Leases , 2006, USENIX Annual Technical Conference, General Track.

[20]  Daniel S. Katz,et al.  Web-based Tools -- Montage: An astronomical image mosaic engine , 2007 .

[21]  Yasuo Fujii,et al.  An interval arithmetic method for global optimization , 1979, Computing.

[22]  Sang Hyuk Son,et al.  Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms* , 2001, Real-Time Systems.

[23]  Carl Kesselman,et al.  Adaptive pricing for resource reservations in Shared environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[24]  Klara Nahrstedt,et al.  A distributed resource management architecture that supports advance reservations and co-allocation , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[25]  Renato J. O. Figueiredo,et al.  A case for grid computing on virtual machines , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[26]  Xiaoyun Zhu,et al.  Triage: Performance differentiation for storage systems using adaptive control , 2005, TOS.

[27]  Borja Sotomayor,et al.  Virtual Clusters for Grid Communities , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).