Complementing software pipelining with software thread integration

Software pipelining is a critical optimization for producing efficient code for VLIW/EPIC and superscalar processors in high-performance embedded applications such as digital signal processing. Software thread integration (STI) can often improve the performance of looping code in cases where software pipelining performs poorly or fails. This paper examines both situations, presenting methods to determine what and when to integrate.We evaluate our methods on C-language image and digital signal processing libraries and synthetic loop kernels. We compile them for a very long instruction word (VLIW) digital signal processor (DSP) -- the Texas Instruments (TI) C64x architecture. Loops which benefit little from software pipelining (SWP-Poor) speed up by 26% (harmonic mean, HM). Loops for which software pipelining fails (SWP-Fail) due to conditionals and calls speed up by 16% (HM). Combining SWP-Good and SWP-Poor loops leads to a speedup of 55% (HM).

[1]  Wen-mei W. Hwu,et al.  Modulo scheduling of loops in control-intensive non-numeric programs , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[2]  Philip H. Sweany,et al.  Improving software pipelining with unroll-and-jam , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[3]  Philip H. Sweany,et al.  Optimizing loop performance for clustered VLIW architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[4]  John M. Mellor-Crummey,et al.  FIAT: A Framework for Interprocedural Analysis and Transfomation , 1993, LCPC.

[5]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[6]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[7]  Ernst L. Leiss,et al.  Modulo scheduling for the TMS320C6x VLIW DSP architecture , 1999, LCTES '99.

[8]  K. Yelick,et al.  Generating Permutation Instructions from a High-Level Description , 2004 .

[9]  Alexander Aiken,et al.  Perfect Pipelining: A New Loop Parallelization Technique , 1988, ESOP.

[10]  StotzerEric,et al.  Modulo scheduling for the TMS320C6x VLIW DSP architecture , 1999 .

[11]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[12]  Henry Hoffmann,et al.  A stream compiler for communication-exposed architectures , 2002, ASPLOS X.

[13]  Jian Wang,et al.  GURPR—a method for global software pipelining , 1987, MICRO 20.

[14]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[15]  Paul Le Guernic,et al.  SIGNAL: A declarative language for synchronous programming of real-time systems , 1987, FPCA.

[16]  Ken Kennedy,et al.  Conversion of control dependence to data dependence , 1983, POPL '83.

[17]  Gérard Berry,et al.  The Esterel Synchronous Programming Language: Design, Semantics, Implementation , 1992, Sci. Comput. Program..

[18]  Philip H. Sweany,et al.  Loop fusion for clustered VLIW architectures , 2002, LCTES/SCOPES '02.

[19]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[20]  Margarida F. Jacome,et al.  Compiler-directed ILP extraction for clustered VLIW/EPIC machines: predication, speculation and modulo scheduling , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[21]  Corinna G. Lee,et al.  Software pipelining loops with conditional branches , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[22]  Monica S. Lam,et al.  Interprocedural Analysis for Parallelization , 1995, LCPC.

[23]  Junqiang Sun,et al.  Tms320c6000 cpu and instruction set reference guide , 2000 .

[24]  Krishna Subramanian,et al.  Enhanced modulo scheduling for loops with conditional branches , 1992, MICRO 25.

[25]  Scott A. Mahlke,et al.  Reverse If-Conversion , 1993, PLDI '93.

[26]  David Grove,et al.  Selective specialization for object-oriented languages , 1995, PLDI '95.

[27]  Won So,et al.  Procedure cloning and integration for converting parallelism from coarse to fine grain , 2003, Seventh Workshop on Interaction Between Compilers and Computer Architectures, 2003. INTERACT-7 2003. Proceedings..

[28]  Pascal Raymond,et al.  The synchronous data flow programming language LUSTRE , 1991, Proc. IEEE.

[29]  Richard A. Huff,et al.  Lifetime-sensitive modulo scheduling , 1993, PLDI '93.

[30]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[31]  Robert Stephens,et al.  A survey of stream processing , 1997, Acta Informatica.

[32]  Steve Carr,et al.  Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[33]  John Paul Shen,et al.  Techniques for software thread integration in real-time embedded systems , 1998, Proceedings 19th IEEE Real-Time Systems Symposium (Cat. No.98CB36279).

[34]  John Paul Shen,et al.  System-level issues for software thread integration: guest triggering and host selection , 1999, Proceedings 20th IEEE Real-Time Systems Symposium (Cat. No.99CB37054).

[35]  Alexander G. Dean Compiling for fine-grain concurrency: planning and performing software thread integration , 2002, Proceedings Sixth Annual Workshop on Interaction between Compilers and Computer Architectures.

[36]  Bennett B. Goldberg,et al.  Trimaran - An Infrastructure for Compiler Research in Instruction Level Parallelism , 1998 .

[37]  Ken Kennedy,et al.  A Methodology for Procedure Cloning , 1993, Computer languages.