Predictable High-Performance Computing Using Feedback Control and Admission Control

Historically, batch scheduling has dominated the management of High-Performance Computing (HPC) resources. One of the most significant limitations using this approach is an inability to predict both the start time and end time of jobs. Although existing researches such as resource reservation and queue-time prediction partially address this issue, a more predictable HPC system is needed, particularly for an emerging class of adaptive real-time HPC applications. This paper presents a design and implementation of a predictable HPC system using feedback control and admission control. By creating a virtualized application layer and opportunistically multiplexing concurrent applications through the application of formal control theory, we regulate a job's progress such that the job meets its deadline without requiring exclusive access to resources even in the presence of a wide class of unexpected events. Admission control regulates access to resources when oversubscribed. Our experimental results using five widely used applications show that the feedback and admission controller achieves highly predictable HPC system. The designed feedback controller regulates the HPC job's progress accurately, close to the prediction by theory, thereby, showing the successful application of classic control theory to HPC workloads. In week-long experiments, over 90 percent of jobs met deadlines and the jobs missing deadlines still finished close to the requested deadlines (12.4 percent error).

[1]  Borja Sotomayor,et al.  Combining batch execution and leasing using virtual machines , 2008, HPDC '08.

[2]  Joseph Y.-T. Leung,et al.  Handbook of Real-Time and Embedded Systems , 2007 .

[3]  Steven Manos Life or death decision-making: The medical case for large-scale patient-specific medical simulations , 2008 .

[4]  Chandra Krintz,et al.  Paravirtualization for HPC Systems , 2006, ISPA Workshops.

[5]  Ivan Beschastnikh,et al.  SPRUCE: A System for Supporting Urgent High-Performance Computing , 2006, Grid-Based Problem Solving Environments.

[6]  Sang Hyuk Son,et al.  Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers , 2006, IEEE Transactions on Parallel and Distributed Systems.

[7]  Jing Xu,et al.  On the Use of Fuzzy Modeling in Virtualized Data Center Management , 2007, Fourth International Conference on Autonomic Computing (ICAC'07).

[8]  Alexandru Iosup,et al.  The Characteristics and Performance of Groups of Jobs in Grids , 2007, Euro-Par.

[9]  Klara Nahrstedt,et al.  A distributed resource management architecture that supports advance reservations and co-allocation , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[10]  Karl Johan Åström,et al.  Adaptive Control , 1989, Embedded Digital Control with Microcontrollers.

[11]  Kang G. Shin,et al.  Adaptive control of virtualized resources in utility computing environments , 2007, EuroSys '07.

[12]  David E. Irwin,et al.  Sharing Networked Resources with Brokered Leases , 2006, USENIX Annual Technical Conference, General Track.

[13]  Jimy Dudhia,et al.  The Weather Research and Forecasting Model [presentation] , 2007 .

[14]  Dhabaleswar K. Panda,et al.  A case for high performance computing with virtual machines , 2006, ICS '06.

[15]  Jerry Draayer,et al.  SURA Coastal Ocean Observing and Prediction (SCOOP) Program: Integrating Marine Science and Information Technology , 2006 .

[16]  Joseph A. Wolkan,et al.  Introduction to probability and statistics , 1994 .

[17]  Sang-Min Park,et al.  Self-Tuning Virtual Machines for Predictable eScience , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[18]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[19]  Marco Spuri,et al.  Deadline Scheduling for Real-Time Systems: Edf and Related Algorithms , 2013 .

[20]  Warren Smith,et al.  Using Run-Time Predictions to Estimate Queue Wait Times and Improve Scheduler Performance , 1999, JSSPP.

[21]  Daniel S. Katz,et al.  Web-based Tools -- Montage: An astronomical image mosaic engine , 2007 .

[22]  Kaizar Amin,et al.  Analysis and Provision of QoS for Distributed Grid Applications , 2004, Journal of Grid Computing.

[23]  Chenyang Lu,et al.  Feedback utilization control in distributed real-time systems with end-to-end tasks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[24]  Richard Wolski,et al.  Predicting bounds on queuing delay for batch-scheduled parallel machines , 2006, PPoPP '06.

[25]  Krithi Ramamritham,et al.  The Spring kernel: a new paradigm for real-time operating systems , 1989, OPSR.

[26]  Sang Hyuk Son,et al.  Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms* , 2001, Real-Time Systems.

[27]  Yuting Zhang,et al.  Friendly virtual machines: leveraging a feedback-control model for application adaptation , 2005, VEE '05.

[28]  Sara J. Graves,et al.  Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD , 2005, International Conference on Computational Science.

[29]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[30]  Willy Zwaenepoel,et al.  Diagnosing performance overheads in the xen virtual machine environment , 2005, VEE '05.

[31]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[32]  Sang-Min Park,et al.  Feedback-controlled resource sharing for predictable eScience , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Warren Smith,et al.  Scheduling with advanced reservations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[34]  Carl Kesselman,et al.  Adaptive pricing for resource reservations in Shared environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[35]  W. J. DeCoursey,et al.  Introduction: Probability and Statistics , 2003 .

[36]  Xiaoyun Zhu,et al.  Triage: Performance differentiation for storage systems using adaptive control , 2005, TOS.