Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures

The manuscript describes fast and scalable architectures and associated algorithms for computing convolutions and cross-correlations. The basic idea is to map 2D convolutions and cross-correlations to a collection of 1D convolutions and cross-correlations in the transform domain. This is accomplished through the use of the discrete periodic radon transform for general kernels and the use of singular value decomposition -LU decompositions for low-rank kernels. The approach uses scalable architectures that can be fitted into modern FPGA and Zynq-SOC devices. Based on different types of available resources, for <inline-formula> <tex-math notation="LaTeX">$P\times P$ </tex-math></inline-formula> blocks, 2D convolutions and cross-correlations can be computed in just <inline-formula> <tex-math notation="LaTeX">$O(P)$ </tex-math></inline-formula> clock cycles up to <inline-formula> <tex-math notation="LaTeX">$O(P^{2})$ </tex-math></inline-formula> clock cycles. Thus, there is a trade-off between performance and required numbers and types of resources. We provide implementations of the proposed architectures using modern programmable devices (Virtex-7 and Zynq-SOC). Based on the amounts and types of required resources, we show that the proposed approaches significantly outperform current methods.

[1]  Alan C. Bovik,et al.  The Essential Guide to Video Processing , 2009, J. Electronic Imaging.

[2]  Greg Brown,et al.  A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications , 2015, TRETS.

[3]  H. T. Kung Why systolic architectures? , 1982, Computer.

[4]  D. N. Kim,et al.  Fast Fourier Transform - Algorithms and Applications , 2010 .

[5]  Alan C. Bovik,et al.  The Essential Guide to Image Processing , 2009, J. Electronic Imaging.

[6]  Mark S. Nixon,et al.  Feature Extraction & Image Processing for Computer Vision, Third Edition , 2012 .

[7]  Wan-Chi Siu,et al.  On the convolution property of a new discrete Radon transform and its efficient inversion algorithm , 1995, Proceedings of ISCAS'95 - International Symposium on Circuits and Systems.

[8]  Mark S. Nixon,et al.  Feature extraction & image processing for computer vision , 2012 .

[9]  Abbes Amira,et al.  FPGA implementations of fast fourier transforms for real-time signal and image processing , 2003, Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798).

[10]  Daniel Llamocca,et al.  Separable FIR Filtering in FPGA and GPU Implementations: Energy, Performance, and Accuracy Considerations , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[11]  Yong Dou,et al.  Optimized Generation of Memory Structure in Compiling Window Operations onto Reconfigurable Hardware , 2007, ARC.

[12]  Daniel Llamocca,et al.  Fast and Scalable Computation of the Forward and Inverse Discrete Periodic Radon Transform , 2016, IEEE Transactions on Image Processing.

[13]  Wu-Sheng Lu,et al.  Design of two-dimensional FIR digital filters by using the singular-value decomposition , 1987 .

[14]  Greg Brown,et al.  A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications , 2012, FPGA '12.

[15]  Wan-Chi Siu,et al.  The discrete periodic Radon transform , 1996, IEEE Trans. Signal Process..

[16]  A. Kingston,et al.  Projective Transforms on Periodic Discrete Image Arrays , 2006 .

[17]  A. Antoniou Digital Signal Processing: Signals, Systems, and Filters , 2005 .

[18]  Daniel Llamocca,et al.  Dynamic Energy, Performance, and Accuracy Optimization and Management Using Automatically Generated Constraints for Separable 2D FIR Filtering for Digital Video Processing , 2014, TRETS.

[19]  Abbes Amira,et al.  FPGA Realization of FIR Filters by Efficient and Flexible Systolization Using Distributed Arithmetic , 2008, IEEE Transactions on Signal Processing.

[20]  Yiannis Andreopoulos,et al.  Precision–Energy–Throughput Scaling of Generic Matrix Multiplication and Convolution Kernels via Linear Projections , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Basant Kumar Mohanty,et al.  Cost-effective novel flexible cell-level systolic architecture for high throughput implementation of 2-D FIR filters , 1996 .

[22]  A.K. Krishnamurthy,et al.  Multidimensional digital signal processing , 1985, Proceedings of the IEEE.

[23]  Keshab K. Parhi,et al.  Pipelined Parallel FFT Architectures via Folding Transformation , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Daniel Llamocca,et al.  The Fast Discrete Periodic Radon Transform for prime sized images: Algorithm, architecture, and VLSI/FPGA implementation , 2014, 2014 Southwest Symposium on Image Analysis and Interpretation.

[25]  Hon Keung Kwan,et al.  2-D systolic arrays for realization of 2-D convolution , 1990 .

[26]  Marios S. Pattichis,et al.  A Dynamic Dual Fixed-Point Arithmetic Architecture for FPGAs , 2011, Int. J. Reconfigurable Comput..

[27]  Yiannis Andreopoulos,et al.  Throughput Scaling Of Convolution For Error-Tolerant Multimedia Applications , 2012, IEEE Transactions on Multimedia.