Feedback-controlled resource sharing for predictable eScience

The emerging class of adaptive, real-time, data-driven applications are a significant problem for today's HPC systems. In general, it is extremely difficult for queuing-system-controlled HPC resources to make and guarantee a tightly-bounded prediction regarding the time at which a newly-submitted application will execute. While a reservation-based approach partially addresses the problem, it can create severe resource under-utilization (unused reservations, necessary scheduled idle slots, underutilized reservations, etc.) that resource providers are eager to avoid. In contrast, this paper presents a fundamentally different approach to guarantee predictable execution. By creating a virtualized application layer called the performance container, and opportunistically multiplexing concurrent performance containers through the application of formal feedback control theory, we regulate the job's progress such that the job meets its deadline without requiring exclusive access to resources even in the presence of a wide class of unexpected disturbances. Our evaluation using two widely-used applications, WRF and BLAST, on an 8-core server show our approach is predictable and meets deadlines with 3.4 % of errors on average while achieving high overall utilization.

[1]  Gene F. Franklin,et al.  Digital control of dynamic systems , 1980 .

[2]  Borja Sotomayor,et al.  Division of Labor: Tools for Growing and Scaling Grids , 2006, ICSOC.

[3]  Joseph A. Wolkan,et al.  Introduction to probability and statistics , 1994 .

[4]  Kyungsook Y. Lee,et al.  n-Dimensional Processor Arrays with Optical dBuses , 2004, The Journal of Supercomputing.

[5]  Sang-Min Park,et al.  Data throttling for data-intensive workflows , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[6]  J. Susan Milton,et al.  Introduction to Probability and Statistics: Principles and Applications for Engineering and the Computing Sciences , 1990 .

[7]  Jimy Dudhia,et al.  Development of a next-generation regional weather research and forecast model. , 2001 .

[8]  Carl Kesselman,et al.  Adaptive pricing for resource reservations in Shared environments , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[9]  David E. Irwin,et al.  Sharing Networked Resources with Brokered Leases , 2006, USENIX Annual Technical Conference, General Track.

[10]  Renato J. O. Figueiredo,et al.  A case for grid computing on virtual machines , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[11]  Yixin Diao,et al.  Feedback Control of Computing Systems , 2004 .

[12]  Donald F. Towsley,et al.  A control theoretic analysis of RED , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[13]  Chandra Krintz,et al.  Paravirtualization for HPC Systems , 2006, ISPA Workshops.

[14]  Sara J. Graves,et al.  Towards Dynamically Adaptive Weather Analysis and Forecasting in LEAD , 2005, International Conference on Computational Science.

[15]  Xiaoyun Zhu,et al.  Triage: Performance differentiation for storage systems using adaptive control , 2005, TOS.

[16]  Ali Afzal,et al.  Making the Grid Predictable through Reservations and Performance Modelling , 2005, Comput. J..

[17]  Borja Sotomayor,et al.  Virtual Clusters for Grid Communities , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[18]  Richard Wolski,et al.  Predicting bounds on queuing delay for batch-scheduled parallel machines , 2006, PPoPP '06.

[19]  Andrew Warfield,et al.  Xen and the art of virtualization , 2003, SOSP '03.

[20]  Jack J. Dongarra,et al.  A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[21]  Willy Zwaenepoel,et al.  Diagnosing performance overheads in the xen virtual machine environment , 2005, VEE '05.

[22]  Richard Wolski,et al.  Multivariate Resource Performance Forecasting in the Network Weather Service , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[23]  Warren Smith,et al.  Scheduling with advanced reservations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[24]  Klara Nahrstedt,et al.  A distributed resource management architecture that supports advance reservations and co-allocation , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[25]  Ivan Beschastnikh,et al.  SPRUCE: A System for Supporting Urgent High-Performance Computing , 2006, Grid-Based Problem Solving Environments.

[26]  Sang Hyuk Son,et al.  Feedback Control Architecture and Design Methodology for Service Delay Guarantees in Web Servers , 2006, IEEE Transactions on Parallel and Distributed Systems.

[27]  Charng-Da Lu,et al.  Compact Application Signatures for Parallel and Distributed Scientific Codes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[28]  Kaizar Amin,et al.  Analysis and Provision of QoS for Distributed Grid Applications , 2004, Journal of Grid Computing.

[29]  Borja Sotomayor,et al.  Combining batch execution and leasing using virtual machines , 2008, HPDC '08.

[30]  Sang Hyuk Son,et al.  Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms* , 2001, Real-Time Systems.