Domain-specific optimizations that exploit specific arithmetic and representation formats have been shown to achieve significant performance/area gains in FPGA hardware designs. In this work, we describe an approach to domain-specific optimization that goes beyond this representation level. We perform a joint optimization from a high-level mathematical abstract representation and hardware implementation point of view. We focus on a signal recognition system that distinguishes between spoken digits. We construct transform matrices from Walsh wavelet packets in conjunction with a BestBasis algorithm. The resulting transform matrices exhibit a rich algebraic structure and contain significant overlap across rows, exhibiting significant computation reuse in the dot-product operation of the transform matrix applied to the signal vector. We have developed an algorithm for identifying the computation reuse and scheduling the row computations across various computation units to significantly reduce the overall amount of computation.
We have implemented a custom-built dot-product multiplication unit targeting a Virtex-II-Pro FPGA device that exploits computation reuse. A baseline dot-product multiplication unit, without reuse, exhibits a maximum clock rate of 199.3 MHz while utilizing only 2% of the device capacity. The optimized system that exploits reuse also includes a computation scheduler and attains a respectable clock rate of 196 MHz while using 8,183 (57%) slices of the FPGA device. The FPGA hardware implementation reduces the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single pipelined dot-product unit over the baseline implementation. Although it is larger in area than the baseline, the implementation that exploits reuse even achieves a 2× computation reduction when compared to 3 concurrently-executing simpler accumulation units with the same aggregate FPGA design area.
While the results in this paper reflect the opportunities of a specific signal processing problem, this work highlights the concept of exploiting computation reuse derived from a higher-level abstract representation at a mathematical and hardware level. As such, we believe this approach can also be leveraged in other signal recognition problems with specific well-characterized computational structures and signal dictionaries.
[1]
Yi Wang,et al.
A Combined Hardware/Software Optimization Framework for Signal Representation and Recognition
,
2007,
International Conference on Computational Science.
[2]
Ronald R. Coifman,et al.
Entropy-based algorithms for best basis selection
,
1992,
IEEE Trans. Inf. Theory.
[3]
Yong Dou,et al.
64-bit floating-point FPGA matrix multiplication
,
2005,
FPGA '05.
[4]
Matteo Frigo,et al.
A fast Fourier transform compiler
,
1999,
SIGP.
[5]
André DeHon,et al.
Floating-point sparse matrix-vector multiply for FPGAs
,
2005,
FPGA '05.
[6]
David Gregg,et al.
High Performance Scientific Computing Using FPGAs with IEEE Floating Point and Logarithmic Arithmetic for Lattice QCD
,
2006,
2006 International Conference on Field Programmable Logic and Applications.
[7]
Ronald R. Coifman,et al.
Local discriminant bases
,
1994,
Optics & Photonics.
[8]
Franz Franchetti,et al.
SPIRAL: Code Generation for DSP Transforms
,
2005,
Proceedings of the IEEE.
[9]
Franz Franchetti,et al.
Generating FPGA-Accelerated DFT Libraries
,
2007
.
[10]
Viktor K. Prasanna,et al.
Sparse Matrix-Vector multiplication on FPGAs
,
2005,
FPGA '05.