论文信息 - A Transformation Framework for Optimizing Task-Parallel Programs

A Transformation Framework for Optimizing Task-Parallel Programs

Task parallelism has increasingly become a trend with programming models such as OpenMP 3.0, Cilk, Java Concurrency, X10, Chapel and Habanero-Java (HJ) to address the requirements of multicore programmers. While task parallelism increases productivity by allowing the programmer to express multiple levels of parallelism, it can also lead to performance degradation due to increased overheads. In this article, we introduce a transformation framework for optimizing task-parallel programs with a focus on task creation and task termination operations. These operations can appear explicitly in constructs such as async, finish in X10 and HJ, task, taskwait in OpenMP 3.0, and spawn, sync in Cilk, or implicitly in composite code statements such as foreach and ateach loops in X10, forall and foreach loops in HJ, and parallel loop in OpenMP. Our framework includes a definition of data dependence in task-parallel programs, a happens-before analysis algorithm, and a range of program transformations for optimizing task parallelism. Broadly, our transformations cover three different but interrelated optimizations: (1) finish-elimination, (2) forall-coarsening, and (3) loop-chunking. Finish-elimination removes redundant task termination operations, forall-coarsening replaces expensive task creation and termination operations with more efficient synchronization operations, and loop-chunking extracts useful parallelism from ideal parallelism. All three optimizations are specified in an iterative transformation framework that applies a sequence of relevant transformations until a fixed point is reached. Further, we discuss the impact of exception semantics on the specified transformations, and extend them to handle task-parallel programs with precise exception semantics. Experimental results were obtained for a collection of task-parallel benchmarks on three multicore platforms: a dual-socket 128-thread (16-core) Niagara T2 system, a quad-socket 16-core Intel Xeon SMP, and a quad-socket 32-core Power7 SMP. We have observed that the proposed optimizations interact with each other in a synergistic way, and result in an overall geometric average performance improvement between 6.28× and 10.30×, measured across all three platforms for the benchmarks studied.

[1] Yi Guo,et al. Work-first and help-first scheduling policies for async-finish task parallelism , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[2] Mary Lou Soffa,et al. Concurrency analysis in the presence of procedures using a data-flow framework , 1991, TAV4.

[3] Vivek Sarkar,et al. Efficient Dependence Analysis for Java Arrays , 2001, Euro-Par.

[4] CONSTANTINE D. POLYCHRONOPOULOS,et al. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[5] Sayantan Sur,et al. Efficient, portable implementation of asynchronous multi-place programs , 2009, PPoPP '09.

[6] Rajiv Gupta. The fuzzy barrier: a mechanism for high speed synchronization of processors , 1989, ASPLOS III.

[7] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[8] Michael Wolfe,et al. High performance compilers for parallel computing , 1995 .

[9] Lieven Eeckhout,et al. Statistically rigorous java performance evaluation , 2007, OOPSLA.

[10] Chau-Wen Tseng,et al. Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[11] Leslie Lamport,et al. Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[12] Martin C. Rinard,et al. Purity and Side Effect Analysis for Java Programs , 2005, VMCAI.

[13] Vivek Sarkar,et al. May-happen-in-parallel analysis of X10 programs , 2007, PPoPP.

[14] Laurie Hendren,et al. Soot: a Java bytecode optimization framework , 2010, CASCON.

[15] David Holmes,et al. Java Concurrency in Practice , 2006 .

[16] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[17] Alan Weiss,et al. Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.

[18] James R. Larus,et al. Transactional Memory , 2006, Transactional Memory.

[19] David H. Bailey,et al. The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[20] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[21] Charles E. Leiserson,et al. Efficient Detection of Determinacy Races in Cilk Programs , 1997, SPAA '97.

[22] Jason Duell,et al. Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[23] Monica S. Lam,et al. Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[24] Alejandro Duran,et al. Unrolling Loops Containing Task Parallelism , 2009, LCPC.

[25] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[26] Koichi Wada,et al. Barrier Elimination Based on Access Dependency Analysis for OpenMP , 2006, ISPA.

[27] Edith Schonberg,et al. A compiler-assisted approach to SPMD execution , 1990, Proceedings SUPERCOMPUTING '90.

[28] Alexander V. Veidenbaum,et al. Synchronization optimizations for efficient execution on multi-cores , 2009, ICS '09.

[29] Michael R. Clarkson,et al. Polyglot: An Extensible Compiler Framework for Java , 2003, CC.

[30] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[31] David Grove,et al. Optimization of Object-Oriented Programs Using Static Class Hierarchy Analysis , 1995, ECOOP.

[32] James F. Power,et al. Platform independent dynamic Java virtual machine analysis: the Java Grande Forum benchmark suite , 2001, JGI '01.

[33] Håkan Grahn,et al. Transactional memory , 2010, J. Parallel Distributed Comput..

[34] Alejandro Duran,et al. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[35] Vivek Sarkar,et al. Location Consistency-A New Memory Model and Cache Consistency Protocol , 2000, IEEE Trans. Computers.

[36] Ken Kennedy,et al. Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[37] Steven J. Deitz,et al. The High-Level Parallel Language ZPL Improves Productivity and Performance , 2004 .

[38] Michael Philippsen,et al. Synchronization barrier elimination in synchronous FORALLs , 1993 .

[39] Michael Metcalf,et al. Fortran 90 Explained , 1990 .

[40] Michael Wolfe,et al. Data dependence and its application to parallel processing , 2005, International Journal of Parallel Programming.

[41] Vivek Sarkar,et al. Intermediate language extensions for parallelism , 2011, SPLASH Workshops.

[42] Vivek Sarkar,et al. Reducing task creation and termination overhead in explicitly parallel programs , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43] Mahmut T. Kandemir,et al. Temperature-sensitive loop parallelization for chip multiprocessors , 2005, 2005 International Conference on Computer Design.

[44] Ondrej Lhoták,et al. Scaling Java Points-to Analysis Using SPARK , 2003, CC.

[45] Vivek Sarkar,et al. Chunking parallel loops in the presence of synchronization , 2009, ICS.

[46] Martin C. Rinard,et al. Synchronization transformations for parallel computing , 1999, POPL '97.

[47] Vivek Sarkar. Synchronization using counting semaphores , 1988, ICS '88.

[48] Vivek Sarkar,et al. Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.