Communication optimizations for global multi-threaded instruction scheduling

The recent shift in the industry towards chip multiprocessor (CMP) designs has brought the need for multi-threaded applications to mainstream computing. As observed in several limit studies, most of the parallelization opportunities require looking for parallelism beyond local regions of code. To exploit these opportunities, especially for sequential applications, researchers have recently proposed global multi-threaded instruction scheduling techniques, including DSWP and GREMIO. These techniques simultaneously schedule instructions from large regions of code, such as arbitrary loop nests or whole procedures, and have been shown to be effective at extracting threads for many applications. A key enabler of these global instruction scheduling techniques is the Multi-Threaded Code Generation (MTCG) algorithm proposed in [16], which generates multi-threaded code for any partition of the instructions into threads. This algorithm inserts communication and synchronization instructions in order to satisfy all inter-thread dependences. In this paper, we present a general compiler framework, COCO, to optimize the communication and synchronization instructions inserted by the MTCG algorithm. This framework, based on thread-aware data-flow analyses and graph min-cut algorithms, appropriately models andoptimizes all kinds of inter-thread dependences, including register, memory, and control dependences. Our experiments, using a fully automatic compiler implementation of these techniques, demonstrate significant reductions (about 30% on average) in the number of dynamic communication instructions in code parallelized with DSWP and GREMIO. This reduction in communication translates to performance gains of up to 40%.

[1]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[2]  E. Ayguade,et al.  Modulo scheduling with integrated register spilling for clustered VLIW architectures , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[3]  Wen-mei W. Hwu,et al.  Field-testing IMPACT EPIC research results in Itanium 2 , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[4]  Alexandre E. Eichenberger,et al.  Effective cluster assignment for modulo scheduling , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[5]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[6]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[7]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[8]  Easwaran Raman,et al.  A framework for unrestricted whole-program optimization , 2006, PLDI '06.

[9]  David I. August,et al.  Chip multi-processor scalability for single-threaded applications , 2005, CARN.

[10]  D. R. Fulkerson,et al.  Flows in Networks. , 1964 .

[11]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[12]  Nikil D. Dutt,et al.  Partitioned register files for VLIWs: a preliminary analysis of tradeoffs , 1992, MICRO 25.

[13]  Guilherme Ottoni,et al.  Support for High-Frequency Streaming in CMPs , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  James R. Larus,et al.  Static branch frequency and program profile analysis , 1994, MICRO 27.

[15]  Jong-Deok Choi,et al.  Global communication analysis and optimization , 1996, PLDI '96.

[16]  Guilherme Ottoni,et al.  Global Multi-Threaded Instruction Scheduling , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[17]  David I. August,et al.  Rapid Development of a Flexible Validated Processor Model , 2004 .

[18]  Monica S. Lam,et al.  Limits of control flow on parallelism , 1992, ISCA '92.

[19]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[20]  Hong-Seok Kim,et al.  Bottom-Up and Top-Down Context-Sensitive Summary-Based Pointer Analysis , 2004, SAS.

[21]  Bernhard Steffen,et al.  Lazy code motion , 1992, PLDI '92.

[22]  Matthew K. Farrens,et al.  Code Partitioning in Decoupled Compilers , 2000, Euro-Par.

[23]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[24]  Mahmut T. Kandemir,et al.  A global communication optimization technique based on data-flow analysis and linear algebra , 1999, TOPL.

[25]  Easwaran Raman,et al.  Speculative Decoupled Software Pipelining , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[26]  Vivek Sarkar,et al.  A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs , 1992, LCPC.

[27]  Saman P. Amarasinghe,et al.  Convergent scheduling , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[28]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[29]  Antonia Zhai,et al.  Compiler optimization of scalar value communication between speculative threads , 2002, ASPLOS X.

[30]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .