A specialized low-cost vectorized loop buffer for embedded processors

Current loop buffer has been mainly explored as an effective architectural technique for low-power execution in embedded processor. Another avenue, however, for exploiting loop buffer is to obtain its performance benefit. In this paper, we propose an application specific loop buffer organization for vectorized processing kernels, to achieve low-power and high-performance goals. The vectorized loop buffer (VLB) is simplified with single loop support for SIMD devices. Since significant data rearrangement overhead is required in order to use the SIMD capabilities, the VLB is specialized for zero-overhead implicit data permutation. We extend several instructions to the baseline ISA for programming and integrate it into an embedded processor for evaluation. Our results show that VLB improves the performance and power measures significantly compared to conventional SIMD devices.

[1]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[2]  Ibrahim N. Hajj,et al.  Architectural and compiler support for energy reduction in the memory hierarchy of high performance microprocessors , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[3]  James E. Smith,et al.  Vector instruction set support for conditional operations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[4]  William H. Mangione-Smith,et al.  The filter cache: an energy efficient memory structure , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[5]  Wei Shi,et al.  SIF: Overcoming the limitations of SIMD devices via implicit permutation , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[6]  Ayal Zaks,et al.  Vectorizing for a SIMdD DSP architecture , 2003, CASES '03.

[7]  Alan Jay Smith,et al.  Measuring the Performance of Multimedia Instruction Sets , 2002, IEEE Trans. Computers.

[8]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[9]  Lizy Kurian John,et al.  Cost-effective hardware acceleration of multimedia applications , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[10]  Raminder Singh Bajwa,et al.  Instruction buffering to reduce power in processors for signal processing , 1997, IEEE Trans. Very Large Scale Integr. Syst..

[11]  Sanjive Agarwala,et al.  Effective hardware-based two-way loop cache for high performance low power processors , 2000, Proceedings 2000 International Conference on Computer Design.