Elimination of Overhead Operations in Complex Loop Structures for Embedded Microprocessors

Looping operations impose a significant bottleneck in achieving better computational efficiency for embedded applications. In this paper, a novel zero-overhead loop controller (ZOLC) supporting arbitrary loop structures with multiple-entry and multiple-exit nodes is described and utilized to enhance embedded RISC processors. A graph formalism is introduced for representing the loop structure of application programs, which can assist in ZOLC code synthesis. Also, a portable description of a ZOLC component which can be exploited in the scope of register transfer level (RTL) synthesis for enabling its utilization is given in detail. This description is designed to be easily retargetable to single-issue RISC processors, requiring only minimal effort for this task. The ZOLC unit has been incorporated into different RISC processor models and research ASIPs at different abstraction levels (RTL VHDL and ArchC) to provide effective means for low-overhead looping without negative impact to the processor cycle time. Average performance improvements of 25.5 percent and 44 percent are feasible for a set of kernel benchmarks on an embedded RISC and an application-specific processor, respectively. A corresponding 10 percent speedup is achieved on the same RISC for a subset of MiBench applications, not necessarily featuring the examined performance-critical kernels.

[1]  Frank Mueller,et al.  Handling irreducible loops: optimized node splitting versus DJ-graphs , 2002, TOPL.

[2]  Spiridon Nikolaidis,et al.  Zero-overhead loop controller that implements multimedia algorithms , 2005 .

[3]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[4]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[5]  Larry Carter,et al.  Folklore confirmed: reducible flow graphs are exponentially larger , 2003, POPL '03.

[6]  François Charot,et al.  SALTO : System for Assembly-Language Transformation and Optimization , 1996 .

[7]  Theodore P Shevlin Composed Control Dependence Graph Generator , 2004 .

[8]  Manoj Franklin,et al.  Energy efficient asymmetrically ported register files , 2003, Proceedings 21st International Conference on Computer Design.

[9]  Rodolfo Azevedo,et al.  The ArchC Architecture Description Language and Tools , 2005, International Journal of Parallel Programming.

[10]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[11]  Manuel E. Benitez,et al.  A portable global optimizer and linker , 1988, PLDI '88.

[12]  Frank Mueller,et al.  Handling Irreducible Loops: Optimized Node Splitting vs. DJ-Graphs , 2001, Euro-Par.

[13]  S. Kumar,et al.  A benchmark suite for evaluating configurable computing systems—status, reflections, and future directions , 2000, FPGA '00.

[14]  Ricardo E. Gonzalez,et al.  Xtensa: A Configurable and Extensible Processor , 2000, IEEE Micro.

[15]  Andrew W. Appel,et al.  Iterated register coalescing , 1996, POPL '96.

[16]  Lizy Kurian John,et al.  Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements , 2003, IEEE Trans. Computers.

[17]  In-Cheol Park,et al.  Loop and address code optimization for digital signal processors , 2002 .

[18]  Yunheung Paek,et al.  Compiler transformations for effectively exploiting a zero overhead loop buffer , 2005, Softw. Pract. Exp..

[19]  Nader Bagherzadeh,et al.  Design and analysis of a programmable single-chip architecture for DVB-T base-band receiver , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[20]  Henk Corporaal,et al.  Making graphs reducible with controlled node splitting , 1997, TOPL.

[21]  Dsp Division,et al.  DSP56600 16-bit Digital Signal Processor Family Manual , 1996 .

[22]  Wonyong Sung,et al.  A Compiler-Friendly RISC-Based Digital Signal Processor Synthesis and Performance Evaluation , 2001, J. VLSI Signal Process..

[23]  J. Eyre,et al.  The evolution of DSP processors , 2000, IEEE Signal Process. Mag..

[24]  R. Guerrieri,et al.  IP-reusable 32-bit VLIW Risc core , 2001, Proceedings of the 27th European Solid-State Circuits Conference.

[25]  Jari Nurmi,et al.  A Flexible DSP Core for Embedded Systems , 1997, IEEE Des. Test Comput..