Energy performance of FPGAs on PERFECT suite kernels

Energy efficiency as a metric has gained significant importance in the recent years. The goal of the TAPAS project is to discover and develop revolutionary approaches that will generate energy-efficient designs on embedded computing systems. One suitable target platform for implementing energy-efficient algorithms is the FPGA, due to its low power consumption, high performance and efficiency, and high degree of programmability. We develop optimizations targeted at reducing memory and communication energy dissipation on FPGAs. We design parameterized architectures for six kernels selected from the PERFECT benchmark suite, applying our memory and communication optimizations. Our optimizations demonstrate 3x to 110x energy efficiency improvement compared to baseline implementations.

[1]  Jinwen Tian,et al.  Efficient high-speed/low-power line-based architecture for two-dimensional discrete wavelet transform using lifting scheme , 2006, IEEE Trans. Circuits Syst. Video Technol..

[2]  Basant K. Mohanty,et al.  Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT , 2011, IEEE Transactions on Signal Processing.

[3]  Guo-qiang Ni,et al.  Real-time image histogram equalization using FPGA , 1998, Other Conferences.

[4]  Hi-Seok Kim,et al.  An efficient color demosaicing using Approximated Directional Line Averages , 2008, 2008 International SoC Design Conference.

[5]  W. James MacLean,et al.  An Evaluation of the Suitability of FPGAs for Embedded Vision Systems , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[6]  Swarup Bhunia,et al.  Energy-Efficient Application Mapping in FPGA through Computation in Embedded Memory Blocks , 2012, 2012 25th International Conference on VLSI Design.

[7]  Wayne Luk,et al.  Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing , 2010, 2010 International Conference on Field-Programmable Technology.

[8]  Viktor K. Prasanna,et al.  Energy-efficient large-scale matrix multiplication on FPGAs , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[9]  Alvin M. Despain,et al.  Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementations , 1984, IEEE Transactions on Computers.

[10]  Bevan M. Baas,et al.  A low-power, high-performance, 1024-point FFT processor , 1999, IEEE J. Solid State Circuits.

[11]  Liang-Gee Chen,et al.  On-Chip Memory Optimization Scheme for VLSI Implementation of Line-Based Two-Dimentional Discrete Wavelet Transform , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Keshab K. Parhi,et al.  High-Speed VLSI Implementation of 2-D Discrete Wavelet Transform , 2008, IEEE Transactions on Signal Processing.

[13]  Mats Torkelson,et al.  A new approach to pipeline FFT processor , 1996, Proceedings of International Conference on Parallel Processing.

[14]  Chein-Wei Jen,et al.  High-speed and low-power split-radix FFT , 2003, IEEE Trans. Signal Process..

[15]  Amin Farmahini Farahani,et al.  Modular high-throughput and low-latency sorting units for FPGAs in the Large Hadron Collider , 2011, 2011 IEEE 9th Symposium on Application Specific Processors (SASP).

[16]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[17]  Viktor K. Prasanna,et al.  Large-scale multi-flow regular expression matching on FPGA , 2012, 2012 IEEE 13th International Conference on High Performance Switching and Routing.

[18]  I. Daubechies,et al.  Factoring wavelet transforms into lifting steps , 1998 .

[19]  KimLee-Sup,et al.  An advanced contrast enhancement using partially overlapped sub-block histogram equalization , 2001 .

[20]  Viktor K. Prasanna,et al.  Energy-efficient architecture for stride permutation on streaming data , 2013, 2013 International Conference on Reconfigurable Computing and FPGAs (ReConFig).

[21]  Yap-Peng Tan,et al.  An Efficient and Effective Color Filter Array Demosaicking Method , 2007, 2007 IEEE International Conference on Image Processing.

[22]  Hau Ngo,et al.  Neighborhood dependent approach for low power 2D convolution in video processing applications , 2009, 2009 4th IEEE Conference on Industrial Electronics and Applications.

[23]  A. Pižurica,et al.  Computationally Efficient Locally Adaptive Demosaicing of Color Filter Array Images Using the Dual-Tree Complex Wavelet Packet Transform , 2013, PloS one.

[24]  Chengyi Xiong,et al.  Efficient Architectures for Two-Dimensional Discrete Wavelet Transform Using Lifting Scheme , 2007, IEEE Transactions on Image Processing.

[25]  Gustavo Alonso,et al.  Sorting networks on FPGAs , 2012, The VLDB Journal.

[26]  Lin Wu,et al.  Efficient Multi-Input/Multi-Output VLSI Architecture for Two-Dimensional Lifting-Based Discrete Wavelet Transform , 2011, IEEE Transactions on Computers.

[27]  Viktor K. Prasanna,et al.  Energy-efficient signal processing using FPGAs , 2003, FPGA '03.

[28]  Christoforos E. Kozyrakis,et al.  Convolution engine: balancing efficiency & flexibility in specialized computing , 2013, ISCA.

[29]  Valery Sklyarov,et al.  Parallel FPGA-Based Implementation of Recursive Sorting Algorithms , 2010, 2010 International Conference on Reconfigurable Computing and FPGAs.

[30]  Jianhua Tao,et al.  A Fast Implementation of Adaptive Histogram Equalization , 2006 .

[31]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.