论文信息 - Design and Implementation of an Efficient Thread Partitioning Algorithm

Design and Implementation of an Efficient Thread Partitioning Algorithm

The development of fine-grain multi-threaded program execution models has created an interesting challenge: how to partition a program into threads that can exploit machine parallelism, achieve latency tolerance, and maintain reasonable locality of reference? A successful algorithm must produce a thread partition that best utilizes multiple execution units on a single processing node and handles long and unpredictable latencies. In this paper, we introduce a new thread partitioning algorithm that can meet the above challenge for a range of machine architecture models. A quantitative affinity heuristic is introduced to guide the placement of operations into threads. This heuristic addresses the trade-off between exploiting parallelism and preserving locality. The algorithm is surprisingly simple due to the use of a time-ordered event list to account for the multiple execution unit activities. We have implemented the proposed algorithm and our experiments, performed on a wide range of examples, have demonstrated its efficiency and effectiveness.

[1] Vivek Sarkar,et al. Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[2] Seth Copen Goldstein,et al. Separation constraint partitioning: a new algorithm for partitioning non-strict programs into sequential threads , 1995, POPL '95.

[3] Klaus Erik Schauser,et al. Compiling lenient languages for parallel asynchronous execution , 1994 .

[4] Guang R. Gao,et al. Compiling C for the EARTH multithreaded architecture , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[5] Robert A. Iannucci,et al. A dataflow/von Neumann hybrid architecture , 1988 .

[6] Guang R. Gao,et al. Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling , 1996, International Symposium on Computer Architecture.

[7] Vivek Sarkar. Instruction reordering for fork-join parallelism , 1990, PLDI '90.

[8] Jake K. Aggarwal,et al. A Generalized Scheme for Mapping Parallel Algorithms , 1993, IEEE Trans. Parallel Distributed Syst..

[9] Guang R. Gao,et al. Earth: an efficient architecture for running threads , 1999 .

[10] V. Gerald Grafe,et al. Compile-time partitioning of a non-strict language into sequential threads , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[11] David E. Culler,et al. Compiler-Controlled Multithreading for Lenient Parallel Languages , 1991, FPCA.

[12] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[13] Jing Wang,et al. Thread partitioning and scheduling based on cost model , 1997, SPAA '97.

[14] Rishiyur S. Nikhil Arvind,et al. Id: a language with implicit parallelism , 1992 .

[15] Seth Copen Goldstein,et al. TAM - A Compiler Controlled Threaded Abstract Machine , 1993, J. Parallel Distributed Comput..

[16] Tao Yang,et al. List Scheduling With and Without Communication Delays , 1993, Parallel Comput..

[17] Lucas Roh,et al. Code generations, evaluations, and optimizations in multithreaded executions , 1996 .

[18] Tao Yang,et al. On the Granularity and Clustering of Directed Acyclic Task Graphs , 1993, IEEE Trans. Parallel Distributed Syst..

[19] E.L. Lawler,et al. Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[20] Walid A. Najjar,et al. An Evaluation of Optimized Threaded Code Generation , 1994, IFIP PACT.

[21] Gerd Heber,et al. A new approach to parallel dynamic partitioning for adaptive unstructured meshes , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[22] John Glauert,et al. SISAL: streams and iteration in a single assignment language. Language reference manual, Version 1. 2. Revision 1 , 1985 .

[23] Gerd Heber,et al. Load adaptive algorithms and implementations for the 2D discrete wavelet transform on fine-grain multithreaded architectures , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[24] Walid A. Najjar,et al. An evaluation of bottom-up and top-down thread generation techniques , 1993, Proceedings of the 26th Annual International Symposium on Microarchitecture.

[25] David E. Culler,et al. Global analysis for partitioning non-strict programs into sequential threads , 1992, LFP '92.

[26] K. R. Traub,et al. A COMPILER FOR THE MIT TAGGED-TOKEN DATAFLOW ARCHITECTURE , 1986 .

[27] K. Mani Chandy,et al. A comparison of list schedules for parallel processing systems , 1974, Commun. ACM.