Job Coscheduling on Coupled High-End Computing Systems

Supercomputer centers often deploy large-scale computing systems together with an associated data analysis or visualization system. In this paper, we propose a co scheduling mechanism, providing the ability to coordinate execution between jobs on different systems. The mechanism is built on top of a lightweight protocol for coordination between policy domains without manual intervention. We have evaluated this system using real job traces from Intrepid and Eureka, the production Blue Gene/P and data analysis systems, respectively, deployed at Argonne National Laboratory. Our experimental results quantify the costs of co scheduling and demonstrate that co scheduling can be achieved with limited impact on system performance under varying workloads.

[1]  Dror G. Feitelson,et al.  Paired Gang Scheduling , 2003, IEEE Trans. Parallel Distributed Syst..

[2]  Metin Nafi Gürcan,et al.  Coordinating the use of GPU and CPU for improving performance of compute intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[3]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[4]  Ibm Redbooks,et al.  Workload Management With Loadleveler , 2001 .

[5]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[6]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[7]  Anshu Dubey,et al.  Large-scale simulations of buoyancy-driven turbulent nuclear burning , 2008 .

[8]  Sathish S. Vadhiyar,et al.  A metascheduler for the Grid , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.

[9]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[10]  Eduardo Huedo,et al.  A framework for adaptive execution in grids , 2004, Softw. Pract. Exp..

[11]  Dan Tsafrir,et al.  Backfilling Using System-Generated Predictions Rather than User Runtime Estimates , 2007, IEEE Transactions on Parallel and Distributed Systems.

[12]  Ian T. Foster,et al.  MPICH-G2: A Grid-enabled implementation of the Message Passing Interface , 2002, J. Parallel Distributed Comput..

[13]  Arie Shoshani,et al.  Co-Scheduling of Computation and Data on Computer Clusters , 2005, SSDBM.

[14]  Patrick Sobalvarro,et al.  Demand-Based Coscheduling of Parallel Jobs on Multiprogrammed Multiprocessors , 1995, JSSPP.

[15]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[16]  Robert B. Ross,et al.  Accelerating I/O Forwarding in IBM Blue Gene/P Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Michael E. Papka,et al.  Developing a Distributed Collaborative Radiological Visualization Application , 2005, HealthGrid.

[18]  Zhiling Lan,et al.  Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[19]  Miron Livny,et al.  Improving Goodput by Coscheduling CPU and Network Capacity , 1999, Int. J. High Perform. Comput. Appl..

[20]  Dror G. Feitelson,et al.  Flexible coscheduling: mitigating load imbalance and improving utilization of heterogeneous resources , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[21]  Wu-chun Feng,et al.  Buffered coscheduling: a new methodology for multitasking parallel jobs on distributed systems , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[22]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[23]  Scott Pakin,et al.  Dynamic Coscheduling on Workstation Clusters , 1998, JSSPP.

[24]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[25]  Phil Andrews,et al.  Co-scheduling with User-Settable Reservations , 2005, JSSPP.

[26]  Onkar Sahni,et al.  Scalable parallel I/O alternatives for massively parallel partitioned solver systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[27]  Andrew A. Chien,et al.  Evaluating network information models on resource efficiency and application performance in lambda-grids , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[28]  Klara Nahrstedt,et al.  A distributed resource management architecture that supports advance reservations and co-allocation , 1999, 1999 Seventh International Workshop on Quality of Service. IWQoS'99. (Cat. No.98EX354).

[29]  Dan Tsafrir,et al.  A Short Survey of Commercial Cluster Batch Schedulers , 2005 .

[30]  Warren Smith,et al.  Scheduling with advanced reservations , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[31]  Jon MacLaren,et al.  HARC: The Highly-Available Resource Co-allocator , 2007, OTM Conferences.