Design and Implementation of an Efficient Thread Partitioning Algorithm

The development of fine-grain multi-threaded program execution models has created an interesting challenge: how to partition a program into threads that can exploit machine parallelism, achieve latency tolerance, and maintain reasonable locality of reference? A successful algorithm must produce a thread partition that best utilizes multiple execution units on a single processing node and handles long and unpredictable latencies. In this paper, we introduce a new thread partitioning algorithm that can meet the above challenge for a range of machine architecture models. A quantitative affinity heuristic is introduced to guide the placement of operations into threads. This heuristic addresses the trade-off between exploiting parallelism and preserving locality. The algorithm is surprisingly simple due to the use of a time-ordered event list to account for the multiple execution unit activities. We have implemented the proposed algorithm and our experiments, performed on a wide range of examples, have demonstrated its efficiency and effectiveness.

[1]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[2]  Seth Copen Goldstein,et al.  Separation constraint partitioning: a new algorithm for partitioning non-strict programs into sequential threads , 1995, POPL '95.

[3]  Klaus Erik Schauser,et al.  Compiling lenient languages for parallel asynchronous execution , 1994 .

[4]  Guang R. Gao,et al.  Compiling C for the EARTH multithreaded architecture , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[5]  Robert A. Iannucci,et al.  A dataflow/von Neumann hybrid architecture , 1988 .

[6]  Guang R. Gao,et al.  Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling , 1996, International Symposium on Computer Architecture.

[7]  Vivek Sarkar Instruction reordering for fork-join parallelism , 1990, PLDI '90.

[8]  Jake K. Aggarwal,et al.  A Generalized Scheme for Mapping Parallel Algorithms , 1993, IEEE Trans. Parallel Distributed Syst..

[9]  Guang R. Gao,et al.  Earth: an efficient architecture for running threads , 1999 .

[10]  V. Gerald Grafe,et al.  Compile-time partitioning of a non-strict language into sequential threads , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[11]  David E. Culler,et al.  Compiler-Controlled Multithreading for Lenient Parallel Languages , 1991, FPCA.

[12]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[13]  Jing Wang,et al.  Thread partitioning and scheduling based on cost model , 1997, SPAA '97.

[14]  Rishiyur S. Nikhil Arvind,et al.  Id: a language with implicit parallelism , 1992 .

[15]  Seth Copen Goldstein,et al.  TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[16]  Tao Yang,et al.  List Scheduling With and Without Communication Delays , 1993, Parallel Comput..

[17]  Lucas Roh,et al.  Code generations, evaluations, and optimizations in multithreaded executions , 1996 .

[18]  Tao Yang,et al.  On the Granularity and Clustering of Directed Acyclic Task Graphs , 1993, IEEE Trans. Parallel Distributed Syst..

[19]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[20]  Walid A. Najjar,et al.  An Evaluation of Optimized Threaded Code Generation , 1994, IFIP PACT.

[21]  Gerd Heber,et al.  A new approach to parallel dynamic partitioning for adaptive unstructured meshes , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[22]  John Glauert,et al.  SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[23]  Gerd Heber,et al.  Load adaptive algorithms and implementations for the 2D discrete wavelet transform on fine-grain multithreaded architectures , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[24]  Walid A. Najjar,et al.  An evaluation of bottom-up and top-down thread generation techniques , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[25]  David E. Culler,et al.  Global analysis for partitioning non-strict programs into sequential threads , 1992, LFP '92.

[26]  K. R. Traub,et al.  A COMPILER FOR THE MIT TAGGED-TOKEN DATAFLOW ARCHITECTURE , 1986 .

[27]  K. Mani Chandy,et al.  A comparison of list schedules for parallel processing systems , 1974, Commun. ACM.