Software thread integration for instruction-level parallelism

Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy. This article proposes Software Thread Integration (STI) for instruction-level parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. Assuming the parallel procedures are given, we define a methodology for finding the best performing integrated procedure with a minimum compilation time. We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. During integration of threads, different ILP-improving code transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration.

[1]  Alexander G. Dean Compiling for fine-grain concurrency: planning and performing software thread integration , 2002, Proceedings Sixth Annual Workshop on Interaction between Compilers and Computer Architectures.

[2]  Ken Kennedy,et al.  Parallel Programming Support in ParaScope , 1988, Parallel Computing in Science and Engineering.

[3]  Alan E. Charlesworth,et al.  An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family , 1981, Computer.

[4]  Thomas Way,et al.  Using Path Spectra to Direct Function Cloning , 1998 .

[5]  Bennett B. Goldberg,et al.  Trimaran - An Infrastructure for Compiler Research in Instruction Level Parallelism , 1998 .

[6]  Ken Kennedy,et al.  A Methodology for Procedure Cloning , 1993, Computer languages.

[7]  Monica S. Lam,et al.  Interprocedural Analysis for Parallelization , 1995, LCPC.

[8]  Junqiang Sun,et al.  Tms320c6000 cpu and instruction set reference guide , 2000 .

[9]  Huiyang Zhou,et al.  Code size efficiency in global scheduling for ILP processors , 2002, Proceedings Sixth Annual Workshop on Interaction between Compilers and Computer Architectures.

[10]  Steve Johnson,et al.  Compiling C for vectorization, parallelization, and inline expansion , 1988, PLDI '88.

[11]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[12]  William Thies,et al.  Phased scheduling of stream programs , 2003, LCTES '03.

[13]  Wen-mei W. Hwu,et al.  Applying Scalable Interprocedural Pointer Analysis to Embedded Applications , 2004 .

[14]  Z. Greenfield,et al.  The TigerSHARC DSP Architecture , 2000, IEEE Micro.

[15]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[16]  Scott Mahlke,et al.  Three Superblock Scheduling Models for Superscalar and Superpipelined Processors , 1991 .

[17]  Philip H. Sweany,et al.  Loop fusion for clustered VLIW architectures , 2002, LCTES/SCOPES '02.

[18]  David Mosberger,et al.  IA-64 Linux Kernel: Design and Implementation , 2002 .

[19]  Margarida F. Jacome,et al.  Compiler-directed ILP extraction for clustered VLIW/EPIC machines: predication, speculation and modulo scheduling , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[20]  Won So,et al.  Complementing software pipelining with software thread integration , 2005, LCTES '05.

[21]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[22]  Corinna G. Lee,et al.  Software pipelining loops with conditional branches , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[23]  Henry Hoffmann,et al.  StreamIt: A Compiler for Streaming Applications ⁄ , 2002 .

[24]  John Paul Shen,et al.  Techniques for software thread integration in real-time embedded systems , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[25]  Scott A. Mahlke,et al.  Reverse If-Conversion , 1993, PLDI '93.

[26]  Ken Kennedy,et al.  Parascope:a Parallel Programming Environment , 1988 .

[27]  David Grove,et al.  Selective specialization for object-oriented languages , 1995, PLDI '95.

[28]  Won So,et al.  Procedure cloning and integration for converting parallelism from coarse to fine grain , 2003, Seventh Workshop on Interaction Between Compilers and Computer Architectures, 2003. INTERACT-7 2003. Proceedings..

[29]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[30]  Wen-mei W. Hwu,et al.  Modulo scheduling of loops in control-intensive non-numeric programs , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[31]  Alexander Aiken,et al.  Perfect Pipelining: A New Loop Parallelization Technique , 1988, ESOP.

[32]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[33]  J. Janardhan,et al.  Enhanced region scheduling on a program dependence graph , 1992, MICRO 25.

[34]  William Thies,et al.  Linear analysis and optimization of stream programs , 2003, PLDI '03.

[35]  Milind Girkar,et al.  Parafrase-2: an Environment for Parallelizing, Partitioning, Synchronizing, and Scheduling Programs on Multiprocessors , 1989, Int. J. High Speed Comput..

[36]  Ernst L. Leiss,et al.  Modulo scheduling for the TMS320C6x VLIW DSP architecture , 1999, LCTES '99.

[37]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[38]  Wen-mei W. Hwu,et al.  Inline function expansion for compiling C programs , 1989, PLDI '89.

[39]  Jack W. Davidson,et al.  Subprogram Inlining: A Study of its Effects on Program Execution Time , 1992, IEEE Trans. Software Eng..

[40]  Thomas Way,et al.  Demand-driven Inlining Heuristics in a Region-based Optimizing Compiler for ILP Architectures , 2001 .

[41]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[42]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[43]  Kathryn S. McKinley,et al.  Compiling for Heterogeneous System: A Survey and an Approach , 1995 .

[44]  Paul Le Guernic,et al.  SIGNAL: A declarative language for synchronous programming of real-time systems , 1987, FPCA.

[45]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[46]  Gérard Berry,et al.  The Esterel Synchronous Programming Language: Design, Semantics, Implementation , 1992, Sci. Comput. Program..

[47]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[48]  Y. Hu,et al.  Last revision: 8/25/03 Programmable Digital Signal Processor (PDSP): A Survey , 2003 .

[49]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[50]  K. Yelick,et al.  Generating Permutation Instructions from a High-Level Description , 2004 .

[51]  A. Aiken,et al.  Loop Quantization: an Analysis and Algorithm , 1987 .

[52]  Steve Carr,et al.  Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[53]  Krishna Subramanian,et al.  Enhanced modulo scheduling for loops with conditional branches , 1992, MICRO 25.

[54]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[55]  Jian Wang,et al.  GURPR—a method for global software pipelining , 1987, MICRO 20.

[56]  Bede Liu,et al.  Understanding multimedia application characteristics for designing programmable media processors , 1998, Electronic Imaging.

[57]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[58]  Thomas M. Conte,et al.  Treegion scheduling for wide issue processors , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[59]  Alexander G. Dean,et al.  Software thread integration for hardware to software migration , 2000 .

[60]  Albert Cohen,et al.  Deep jam: conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[61]  Saurabh Sharma,et al.  Weld: A Multithreading Technique Towards Latency-Tolerant VLIW Processors , 2001, HiPC.

[62]  Won So,et al.  Reaching fast code faster: using modeling for efficient software thread integration on a VLIW DSP , 2006, CASES '06.

[63]  Todd A. Proebsting,et al.  Filter fusion , 1996, POPL '96.

[64]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[65]  B. Ramakrishna Rau,et al.  Efficient code generation for horizontal architectures: Compiler techniques and architectural support , 1982, ISCA '82.

[66]  Philip H. Sweany,et al.  Optimizing loop performance for clustered VLIW architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[67]  Jian Huang,et al.  The Superthreaded Processor Architecture , 1999, IEEE Trans. Computers.

[68]  John M. Mellor-Crummey,et al.  FIAT: A Framework for Interprocedural Analysis and Transfomation , 1993, LCPC.

[69]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[70]  Richard A. Huff,et al.  Lifetime-sensitive modulo scheduling , 1993, PLDI '93.

[71]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[72]  L. Almagor,et al.  Finding effective compilation sequences , 2004, LCTES '04.

[73]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[74]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[75]  Philip H. Sweany,et al.  Improving software pipelining with unroll-and-jam , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[76]  Andrew Wolfe,et al.  A variable instruction stream extension to the VLIW architecture , 1991, ASPLOS IV.

[77]  Rainer Leupers,et al.  Function inlining under code size constraints for embedded processors , 1999, 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No.99CH37051).

[78]  Emden R. Gansner,et al.  Drawing graphs with dot , 2006 .

[79]  Chris J. Newburn,et al.  EXPLOITING MULTI-GRAINED PARALLELISM FOR MULTIPLE-INSTRUCTION-STREAM ARCHITECTURES , 1997 .

[80]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[81]  Krishna Subramanian,et al.  Enhanced modulo scheduling for loops with conditional branches , 1992, MICRO 1992.

[82]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[83]  Manoj Franklin,et al.  The multiscalar architecture , 1993 .

[84]  Lex Augusteijn,et al.  Instruction Scheduling for TriMedia , 1999, J. Instr. Level Parallelism.

[85]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[86]  Mary Hall Managing interprocedural optimization , 1992 .

[87]  Pascal Raymond,et al.  The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[88]  Guilherme Ottoni,et al.  From sequential programs to concurrent threads , 2006, IEEE Computer Architecture Letters.

[89]  Vicki H. Allan,et al.  Enhanced region scheduling on a program dependence graph , 1992, MICRO 1992.

[90]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2004, The Journal of Supercomputing.

[91]  Scott A. Mahlke,et al.  The superblock: An effective technique for VLIW and superscalar compilation , 1993, The Journal of Supercomputing.

[92]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[93]  Alexander G. Dean,et al.  Compiling for fine-grain concurrency: planning and performing software thread integration , 2002, 23rd IEEE Real-Time Systems Symposium, 2002. RTSS 2002..

[94]  Michael Hind,et al.  Which pointer analysis should I use? , 2000, ISSTA '00.

[95]  Geoffrey Brown,et al.  Lx: a technology platform for customizable VLIW embedded processing , 2000, ISCA '00.

[96]  Siddhartha Shivshankar,et al.  Asynchronous software thread integration for efficient software implementations of embedded communication protocol controllers , 2004 .

[97]  Monica S. Lam,et al.  Interprocedural parallelization analysis in SUIF , 2005, TOPL.

[98]  Rajiv Gupta,et al.  Region Scheduling: An Approach for Detecting and Redistributing Parallelism , 1990, IEEE Trans. Software Eng..