Using processor affinity in loop scheduling on shared-memory multiprocessors

Loops are the single largest source of parallelism in many applications. One way to exploit this parallelism is to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attempted to achieve the minimum completion time by distributing the workload as evenly as possible while minimizing the number of synchronization operations required. The authors consider a third dimension to the problem of loop scheduling on shared-memory multiprocessors: communication overhead caused by accesses to nonlocal data. They show that traditional algorithms for loop scheduling, which ignore the location of data when assigning iterations to processors, incur a significant performance penalty on modern shared-memory multiprocessors. They propose a new loop scheduling algorithm that attempts to simultaneously balance the workload, minimize synchronization, and co-locate loop iterations with the necessary data. They compare the performance of this new algorithm to other known algorithms by using five representative kernel programs on a Silicon Graphics multiprocessor workstation, a BBN Butterfly, a Sequent Symmetry, and a KSR-1, and show that the new algorithm offers substantial performance improvements, up to a factor of 4 in some cases. The authors conclude that loop scheduling algorithms for shared-memory multiprocessors cannot afford to ignore the location of data, particularly in light of the increasing disparity between processor and memory speeds. >

[1]  Mark S. Squillante,et al.  Analysis of task migration in shared-memory multiprocessor scheduling , 1991, SIGMETRICS '91.

[2]  Robert H. Thomas,et al.  The Uniform System: An approach to runtime support for large scale shared memory parallel processors , 1988, ICPP.

[3]  Evangelos P. Markatos,et al.  Multiprogramming on multiprocessors , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[4]  Carla Schlatter Ellis,et al.  Experimental comparison of memory management policies for NUMA multiprocessors , 1991, TOCS.

[5]  Henry M. Levy,et al.  The Performance Implications of Thread Management Alternatives for Shared-Memory Multiprocessors , 1989, IEEE Trans. Computers.

[6]  Robert J. Fowler,et al.  NUMA policies and their relation to memory architecture , 1991, ASPLOS IV.

[7]  Brian N. Bershad,et al.  An Open Environment for Building Parallel Programming Systems , 1988, PPOPP/PPEALS.

[8]  Sivarama P. Dandamudi A comparison of task scheduling strategies for multiprocessor systems , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[9]  Rajiv Gupta Synchronization and Communication Costs of Loop Partitioning on Shared-Memory Multiprocessor Systems , 1992, IEEE Trans. Parallel Distributed Syst..

[10]  Raj Vaswani,et al.  The implications of cache affinity on processor scheduling for multiprogrammed, shared memory multiprocessors , 1991, SOSP '91.

[11]  Anoop Gupta,et al.  The impact of operating system scheduling policies and synchronization methods of performance of parallel applications , 1991, SIGMETRICS '91.

[12]  Alan Weiss,et al.  Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.

[13]  Steven Lucco,et al.  A dynamic scheduling method for irregular parallel programs , 1992, PLDI '92.

[14]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[15]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[16]  Michael L. Scott,et al.  Simple but effective techniques for NUMA memory management , 1989, SOSP '89.

[17]  Edith Schonberg,et al.  Factoring: a practical and robust method for scheduling parallel loops , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[18]  E. P. Markatos,et al.  Shared-Memory Multiprocessor Trends and the Implications for Parallel Program Performance , 1992 .

[19]  Anoop Gupta,et al.  Process control and scheduling issues for multiprogrammed shared-memory multiprocessors , 1989, SOSP '89.

[20]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[21]  Mark S. Squillante,et al.  Using Processor-Cache Affinity Information in Shared-Memory Multiprocessor Scheduling , 1993, IEEE Trans. Parallel Distributed Syst..

[22]  Evangelos P. Markatos,et al.  Load Balancing vs. Locality Management in Shared-Memory Multiprocessors , 1992, ICPP.

[23]  Robert J. Fowler,et al.  The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum , 1989, SOSP '89.

[24]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[25]  Shahid H. Bokhari,et al.  Assignment Problems in Parallel and Distributed Computing , 1987 .

[26]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[27]  Raj Vaswani,et al.  A dynamic processor allocation policy for multiprogrammed shared-memory multiprocessors , 1993, TOCS.