Symphony: A Scheduler for Client-Server Applications on Coprocessor-Based Heterogeneous Clusters

Coprocessors such as GPUs are increasingly being deployed in clusters to process scientific and compute-intensive jobs. In this work, we study if GPU-based heterogeneous clusters can benefit client-server applications. Specifically, we consider the practical situation where multiple client-server applications share a heterogeneous cluster (multi-tenancy), and experience unpredictable variations in incoming client request rates, including steep load spikes. Even for "compute-intensive" client-server applications, it is unclear if a GPU-based cluster can seamlessly deliver acceptable response times in the presence of multi-tenancy and load spikes. We argue that a cluster-level scheduler that is aware of application load, request deadlines and the heterogeneity is necessary in this situation. We propose a novel scheduler called Symphony that enables efficient, dynamic sharing of a GPU-based heterogeneous cluster across multiple concurrently-executing client-server applications, each with arbitrary load spikes. Symphony performs three key tasks: it (i) monitors the load on each application, (ii) collects past performance data and dynamically builds simple performance models of available processing resources and (iii) computes a priority for pending requests based on the above parameters and the requests' slack. Based on this, it reorders client requests across different applications to achieve acceptable response times. We also define how client-server applications should interact with a scheduler such as Symphony, and develop an API to this end. We deploy Symphony as user-space middleware on a high-end heterogeneous cluster with dual quad-core Xeon CPUs and dual NVIDIA Fermi GPUs. An evaluation using representative applications shows that in the presence of load spikes (i) Symphony incurs 2-20x fewer requests that do not meet response time constraints compared with other schedulers, and (ii) in order to achieve the same performance as Symphony, other schedulers need 2x more cluster nodes.

[1]  Enrique S. Quintana-Ortí,et al.  Fast development of dense linear algebra codes on graphics processors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2]  Prashant J. Shenoy,et al.  Deadline fair scheduling: bridging the theory and practice of proportionate pair scheduling in multiprocessor systems , 2001, Proceedings Seventh IEEE Real-Time Technology and Applications Symposium.

[3]  Sang-Min Park,et al.  Predictable time-sharing for DryadLINQ cluster , 2010, ICAC '10.

[4]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[5]  Jerome A. Rolia,et al.  Workload Analysis and Demand Prediction of Enterprise Data Center Applications , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[6]  Edward A. Lee,et al.  A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures , 1993, IEEE Trans. Parallel Distributed Syst..

[7]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[8]  Kwan-Liu Ma,et al.  Multi-GPU volume rendering using MapReduce , 2010, HPDC '10.

[9]  Martin Kraus,et al.  GPU-Based Euclidean Distance Transforms and Their Application to Volume Rendering , 2009, VISIGRAPP.

[10]  Rüdiger Westermann,et al.  Interactive Streak Surface Visualization on the GPU , 2009, IEEE Transactions on Visualization and Computer Graphics.

[11]  Chao-Tung Yang,et al.  An Adaptive Job Allocation Strategy for Heterogeneous Multiple Clusters , 2009, 2009 Ninth IEEE International Conference on Computer and Information Technology.

[12]  Salim Hariri,et al.  Task scheduling algorithms for heterogeneous processors , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[13]  Dimitrios S. Nikolopoulos,et al.  A capabilities-aware framework for using computational accelerators in data-intensive computing , 2011, J. Parallel Distributed Comput..

[14]  Harold S. Stone,et al.  Multiprocessor Scheduling with the Aid of Network Flow Algorithms , 1977, IEEE Transactions on Software Engineering.

[15]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Dimitrios S. Nikolopoulos,et al.  Designing Accelerator-Based Distributed Systems for High Performance , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[17]  Surendra Byna,et al.  Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory , 2010, SPAA '10.

[18]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[19]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[20]  S. Chakradhar,et al.  Enabling Legacy Applications on Heterogeneous Platforms , 2010 .

[21]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[22]  Lei Wang,et al.  Task Scheduling of Parallel Processing in CPU-GPU Collaborative Environment , 2008, 2008 International Conference on Computer Science and Information Technology.

[23]  Soonhoi Ha,et al.  A Static Scheduling Heuristic for Heterogeneous Processors , 1996, Euro-Par, Vol. II.

[24]  Frank Leymann,et al.  A Framework for Optimized Distribution of Tenants in Cloud Applications , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[25]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[26]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[28]  Jianxin Li,et al.  An Efficient Resource Management System for On-Line Virtual Cluster Provision , 2009, 2009 IEEE International Conference on Cloud Computing.

[29]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[30]  Yanjun Qi,et al.  Learning to rank with (a lot of) word features , 2010, Information Retrieval.

[31]  Srinath Perera,et al.  Multi-tenant SOA Middleware for Cloud Computing , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[32]  Michael I. Jordan,et al.  Characterizing, modeling, and generating workload spikes for stateful services , 2010, SoCC '10.

[33]  Howard Jay Siegel,et al.  A dynamic matching and scheduling algorithm for heterogeneous computing systems , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).