Automatic thread extraction with decoupled software pipelining

Until recently, a steadily rising clock rate and other uniprocessor micro architectural improvements could be relied upon to consistently deliver increasing performance for a wide range of applications. Current difficulties in maintaining this trend have lead microprocessor manufacturers to add value by incorporating multiple processors on a chip. Unfortunately, since decades of compiler research have not succeeded in delivering automatic threading for prevalent code properties, this approach demonstrates no improvement for a large class of existing codes. To find useful work for chip multiprocessors, we propose an automatic approach to thread extraction, called decoupled software pipelining (DSWP). DSWP exploits the finegrained pipeline parallelism lurking in most applications to extract long-running, concurrently executing threads. Use of the nonspeculative and truly decoupled threads produced by DSWP can increase execution efficiency and provide significant latency tolerance, mitigating design complexity by reducing intercore communication and per-core resource requirements. Using our initial fully automatic compiler implementation and a validated processor model, we prove the concept by demonstrating significant gains for dual-core chip multiprocessor models running a variety of codes. We then explore simple opportunities missed by our initial compiler implementation which suggest a promising future for this approach.

[1]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[2]  David I. August,et al.  Rapid Development of a Flexible Validated Processor Model , 2004 .

[3]  Easwaran Raman,et al.  Practical and accurate low-level pointer analysis , 2005, International Symposium on Code Generation and Optimization.

[4]  David I. August,et al.  Decoupled software pipelining with the synchronization array , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[5]  Krste Asanovic,et al.  Multithreading decoupled architectures for complexity-effective general purpose computing , 2001, CARN.

[6]  Scott A. Mahlke,et al.  Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[7]  David I. August,et al.  Microarchitectural exploration with Liberty , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[8]  Ding-Kai Chen,et al.  Compiler optimizations for parallel loops with fine-grained synchronization , 1994 .

[9]  Manuel E. Benitez,et al.  Code generation for streaming: an access/execute mechanism , 1991, ASPLOS IV.

[10]  Guang R. Gao,et al.  Design and Implementation of an Efficient Thread Partitioning Algorithm , 2000, ISHPC.

[11]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[12]  Sanjay J. Patel,et al.  Beating in-order stalls with "flea-flicker" two-pass pipelining , 2006, IEEE transactions on computers.

[13]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[14]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[15]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[16]  Wen-mei W. Hwu,et al.  Modular interprocedural pointer analysis using access paths: design, implementation, and evaluation , 2000, PLDI '00.

[17]  William A. Wulf,et al.  Evaluation of the WM Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[18]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[19]  Wen-mei W. Hwu,et al.  Field-testing IMPACT EPIC research results in Itanium 2 , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[20]  Scott A. Mahlke,et al.  Compiler code transformations for superscalar-based high-performance systems , 1992, Proceedings Supercomputing '92.

[21]  Donald Yeung,et al.  A study of source-level compiler algorithms for automatic construction of pre-execution code , 2004, TOCS.

[22]  Anant Agarwal,et al.  Scalar operand networks , 2005, IEEE Transactions on Parallel and Distributed Systems.

[23]  Monica S. Lam,et al.  RETROSPECTIVE : Software Pipelining : An Effective Scheduling Technique for VLIW Machines , 1998 .

[24]  Todd C. Mowry,et al.  The potential for using thread-level data speculation to facilitate automatic parallelization , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[25]  Rudolf Eigenmann,et al.  Min-cut program decomposition for thread-level speculation , 2004, PLDI '04.

[26]  Matthew K. Farrens,et al.  Code Partitioning in Decoupled Compilers , 2000, Euro-Par.

[27]  Gurindar S. Sohi,et al.  Master/Slave Speculative Parallelization , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[28]  Vivek Sarkar,et al.  Space-time scheduling of instruction-level parallelism on a raw machine , 1998, ASPLOS VIII.

[29]  Manoj Franklin,et al.  A general compiler framework for speculative multithreading , 2002, SPAA '02.