Effective distributed scheduling of parallel workloads

We present a distributed algorithm for time-sharing parallel workloads that is competitive with coscheduling. Implicit scheduling allows each local scheduler in the system to make independent decisions that dynamically coordinate the scheduling of cooperating processes across processors. Of particular importance is the blocking algorithm which decides the action of a process waiting for a communication or synchronization event to complete. Through simulation of bulk-synchronous parallel applications, we find that a simple two-phase fixed-spin blocking algorithm performs well; a two-phase adaptive algorithm that gathers run-time data on barrier wait-times performs slightly better. Our results hold for a range of machine parameters and parallel program characteristics. These findings are in direct contrast to the literature that states explicit coscheduling is necessary for fine-grained programs. We show that the choice of the local scheduler is crucial, with a priority-based scheduler performing two to three times better than a round-robin scheduler. Overall, we find that the performance of implicit scheduling is near that of coscheduling (+/- 35%), without the requirement of explicit, global coordination.

[1]  P. Messina,et al.  Architectural requirements of parallel scientific applications with explicit communication , 1993, ISCA '93.

[2]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[3]  Larry Rudolph,et al.  Gang Scheduling Performance Benefits for Fine-Grain Synchronization , 1992, J. Parallel Distributed Comput..

[4]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[5]  Richard P. Martin,et al.  LogP Performance Assessment of Fast Network Interfaces , 1995 .

[6]  Kenneth C. Sevcik Characterizations of parallelism in applications and their use in scheduling , 1989, SIGMETRICS '89.

[7]  K. Arvind,et al.  Probabilistic Clock Synchronization in Distributed Systems , 1994, IEEE Trans. Parallel Distributed Syst..

[8]  Kang G. Shin,et al.  Fault-Tolerant Clock Synchronization in Large Multicomputer Systems , 1994, IEEE Trans. Parallel Distributed Syst..

[9]  John Zahorjan,et al.  Scheduling memory constrained jobs on distributed memory parallel computers , 1995, SIGMETRICS '95/PERFORMANCE '95.

[10]  Xian-He Sun,et al.  Distributed computing feasibility in a non-dedicated homogeneous distributed system , 1993, Supercomputing '93. Proceedings.

[11]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[12]  Richard P. Martin,et al.  HPAM: an active message layer for a network of hp workstations , 1994, Symposium Record Hot Interconnects II.

[13]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[14]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[15]  Raj Vaswani,et al.  A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.

[16]  Andrea C. Arpaci-Dusseau,et al.  Fast Parallel Sorting Under LogP: Experience with the CM-5 , 1996, IEEE Trans. Parallel Distributed Syst..

[17]  Dan C. Marinescu,et al.  Models and Algorithms for Coscheduling Compute-Intensive Tasks on a Network of Workstations , 1992, J. Parallel Distributed Comput..

[18]  Samuel J. Leffler,et al.  The design and implementation of the 4.3 BSD Unix operating system , 1991, Addison-Wesley series in computer science.

[19]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[20]  Mary K. Vernon,et al.  Use of application characteristics and limited preemption for run-to-completion parallel processor scheduling policies , 1994, SIGMETRICS.

[21]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[22]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[23]  Larry Rudolph,et al.  Distributed hierarchical control for parallel processing , 1990, Computer.

[24]  Berny Goodheart,et al.  The magic garden explained - the internals of UNIX System V, release 4: an open systems design , 1994 .

[25]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[26]  Leonidas I. Kontothanassis,et al.  Using scheduler information to achieve optimal barrier synchronization performance , 1993, PPOPP '93.

[27]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[28]  Michael L. Scott,et al.  High performance synchromzat~on algomthms for multiprogrammed multiprocessors in proceed , 1995 .

[29]  Richard P. Martin,et al.  Assessing Fast Network Interfaces , 1996, IEEE Micro.

[30]  Evangelos P. Markatos,et al.  Multiprogramming on multiprocessors , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[31]  ACM-Sigmetrics ACM/SIGMETRICS Conference on Measurement and Modeling of Computer Systems : September 14-16, 1981 , 1981 .

[32]  S. T. Leutenegger,et al.  Distributed computing feasibility in a non-dedicated homogeneous distributed system , 1993, Supercomputing '93.

[33]  Remzi H. Arpaci-Dusseau,et al.  Empirical evaluation of the CRAY-T3D: a compiler perspective , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[34]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[35]  John Zahorjan,et al.  Zahorjan processor allocation policies for message-passing parallel computers , 1994, SIGMETRICS 1994.

[36]  Anoop Gupta,et al.  Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.

[37]  Mark S. Squillante,et al.  Analysis of the Impact of Memory in Distributed Parallel Processing Systems , 1994, SIGMETRICS.

[38]  Patrick Sobalvarro,et al.  Demand-Based Coscheduling of Parallel Jobs on Multiprogrammed Multiprocessors , 1995, JSSPP.

[39]  Andrea C. Arpaci-Dusseau,et al.  The interaction of parallel and sequential workloads on a network of workstations , 1995, SIGMETRICS '95/PERFORMANCE '95.

[40]  Satish K. Tripathi,et al.  The Processor Working Set and Its Use in Scheduling Multiprocessor Systems , 1991, IEEE Trans. Software Eng..

[41]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[42]  V. K. Naik,et al.  Performance analysis of job scheduling policies in parallel supercomputing environments , 1993, Supercomputing '93.

[43]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[44]  David A. Patterson,et al.  Logp quantified: the case for low-overhead local area networks , 1995 .

[45]  Mary K. Vernon,et al.  The performance of multiprogrammed multiprocessor scheduling algorithms , 1990, SIGMETRICS '90.

[46]  V. Klema LINPACK user's guide , 1980 .

[47]  Steve Evans,et al.  Optimizing UNIX Resource Scheduling for User Interaction , 1993, USENIX Summer.