Enhancing loop buffering of media and telecommunications applications using low-overhead predication

Media- and telecommunications-focused processors, increasingly designed as deeply pipelined, statically-scheduled VLIWs, rely on loop buffers for low-overhead execution of simple loops. Key loops containing control flow pose a substantial problem-full predication has a high encoding overhead, and partial predication techniques do not support if-conversion, the transformation of general acyclic control flow into predicated blocks. Using a set of significant media processing benchmarks, drawn from MediaBench and contemporary telecommunications standards, we explore a compromise approach. We demonstrate a compiler using if-conversion and specialized loop transformations to arrange for 70-99% of fetched operations to come from a simple, statically managed 256-instruction loop buffer, saving instruction fetch power and eliminating branch penalties. To complement this we introduce a "niche" form of predication specialized to permit general if-conversion with only a single bit in the encoding of each operation and to eliminate much of the hardware overhead of a predicate register-based approach.

[1]  David I. August,et al.  An Architecture Framework for Introducing Predicated Execution into Embedded Microprocessors , 1999, Euro-Par.

[2]  B. R. Rau,et al.  HPL-PD Architecture Specification:Version 1.1 , 2000 .

[3]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[4]  Sumedh W. Sathaye,et al.  Instruction fetch mechanisms for VLIW architectures with compressed encodings , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[5]  Joseph A. Fisher,et al.  Clustered Instruction-Level Parallel Processors , 1998 .

[6]  Antonio González,et al.  Modulo scheduling for a fully-distributed clustered VLIW architecture , 2000, MICRO 33.

[7]  Vittorio Zaccaria,et al.  Exploiting data forwarding to reduce the power budget of VLIW embedded processors , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[8]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  David I. August,et al.  Compiler technology for future microprocessors , 1995, Proc. IEEE.

[10]  Norman P. Jouppi,et al.  An Integrated Cache Timing and Power Model , 2002 .

[11]  M. Schlansker,et al.  On Predicated Execution , 1991 .

[12]  Scott A. Mahlke,et al.  Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[13]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[14]  David B. Whalley,et al.  Effective exploitation of a zero overhead loop buffer , 1999, LCTES '99.

[15]  K. Ebcioğlu A compilation technique for software pipelining of loops with conditional jumps , 1988, SIGM.

[16]  Scott A. Mahlke,et al.  A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.