Vector FPGA acceleration of 1-D DWT computations using sparse matrix skeletons

We can exploit application-specific sparse structure and distribution of non-zero coefficients in Discrete Wavelet Transform (DWT) matrices to significantly improve the performance of 1-D DWT mapped to FPGA-based soft vector processors. We reformulate DWT computations specifically in terms of sparse matrix operations, where the transformation matrices have a repeating block with a fixed non-zero pattern, which we refer to as a skeleton. We exploit this property to transform the original DWT matrix into a Modified-Matrix-Form to expose abundant soft vector parallelism in the dot products. The resulting form can also be readily compiled into low-level DMA routines for boosting memory throughput. We autogenerate vector routines and memory access sequences tailored for parametric combinations of DWT filter sizes, and decomposition levels as required by the application domain. When compared to embedded ARMv7 32b CPU implementations using optimized OpenBLAS routines, soft vector implementation on the Xilinx Zedboard and Altera DE2/DE4 platforms demonstrate speedups of 12-103×.

[1]  Guy Lemieux,et al.  Embedded supercomputing in FPGAs with the VectorBlox MXP Matrix Processor , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[2]  Kaushal K. Shukla,et al.  Efficient Algorithms for Discrete Wavelet Transform: With Applications to Denoising and Fuzzy Inference Systems , 2013 .

[3]  Chien-Hsun Tseng,et al.  Efficient and Effective VLSI Architecture for a Wavelet-based Broadband Sonar Signal Detection System , 2007, 2007 14th IEEE International Conference on Electronics, Circuits and Systems.

[4]  Arjuna Madanayake,et al.  Precise VLSI Architecture for AI Based 1-D/ 2-D Daub-6 Wavelet Filter Banks With Low Adder-Count , 2014, IEEE Transactions on Circuits and Systems I: Regular Papers.

[5]  Koushik Maharatna,et al.  An automated algorithm for online detection of fragmented QRS and identification of its various morphologies , 2013, Journal of The Royal Society Interface.

[6]  M. Omair Ahmad,et al.  A Pipeline VLSI Architecture for High-Speed Computation of the 1-D Discrete Wavelet Transform , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[7]  A. Jensen,et al.  Ripples in Mathematics - The Discrete Wavelet Transform , 2001 .

[8]  Kaushal K. Shukla,et al.  Efficient Algorithms for Discrete Wavelet Transform , 2013, SpringerBriefs in Computer Science.

[9]  Thomas Meinl,et al.  A new wavelet-based denoising algorithm for high-frequency financial data mining , 2012, Eur. J. Oper. Res..

[10]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).