论文信息 - Enhancing loop buffering of media and telecommunications applications using low-overhead predication

Enhancing loop buffering of media and telecommunications applications using low-overhead predication

Media- and telecommunications-focused processors, increasingly designed as deeply pipelined, statically-scheduled VLIWs, rely on loop buffers for low-overhead execution of simple loops. Key loops containing control flow pose a substantial problem-full predication has a high encoding overhead, and partial predication techniques do not support if-conversion, the transformation of general acyclic control flow into predicated blocks. Using a set of significant media processing benchmarks, drawn from MediaBench and contemporary telecommunications standards, we explore a compromise approach. We demonstrate a compiler using if-conversion and specialized loop transformations to arrange for 70-99% of fetched operations to come from a simple, statically managed 256-instruction loop buffer, saving instruction fetch power and eliminating branch penalties. To complement this we introduce a "niche" form of predication specialized to permit general if-conversion with only a single bit in the encoding of each operation and to eliminate much of the hardware overhead of a predicate register-based approach.

Wen-mei W. Hwu | Hillery C. Hunter | John W. Sias

[1] David I. August,et al. An Architecture Framework for Introducing Predicated Execution into Embedded Microprocessors , 1999, Euro-Par.

[2] B. R. Rau,et al. HPL-PD Architecture Specification:Version 1.1 , 2000 .

[3] Scott Mahlke,et al. Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[4] Sumedh W. Sathaye,et al. Instruction fetch mechanisms for VLIW architectures with compressed encodings , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[5] Joseph A. Fisher,et al. Clustered Instruction-Level Parallel Processors , 1998 .

[6] Antonio González,et al. Modulo scheduling for a fully-distributed clustered VLIW architecture , 2000, MICRO 33.

[7] Vittorio Zaccaria,et al. Exploiting data forwarding to reduce the power budget of VLIW embedded processors , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[8] Miodrag Potkonjak,et al. MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9] David I. August,et al. Compiler technology for future microprocessors , 1995, Proc. IEEE.

[10] Norman P. Jouppi,et al. An Integrated Cache Timing and Power Model , 2002 .

[11] M. Schlansker,et al. On Predicated Execution , 1991 .

[12] Scott A. Mahlke,et al. Integrated predicated and speculative execution in the IMPACT EPIC architecture , 1998, ISCA.

[13] B. Ramakrishna Rau,et al. Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[14] David B. Whalley,et al. Effective exploitation of a zero overhead loop buffer , 1999, LCTES '99.

[15] K. Ebcioğlu. A compilation technique for software pipelining of loops with conditional jumps , 1988, SIGM.

[16] Scott A. Mahlke,et al. A comparison of full and partial predicated execution support for ILP processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.