Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.

[1]  Ronald R. Coifman,et al.  Local discriminant bases , 1994, Optics & Photonics.

[2]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[3]  Juan Arturo Nolazco-Flores,et al.  An FPGA-based coprocessor for the SPHINX speech recognition system: early experiences , 2005, 2005 International Conference on Reconfigurable Computing and FPGAs (ReConFig'05).

[4]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[5]  Karl S. Hemmert,et al.  Embedded floating-point units in FPGAs , 2006, FPGA '06.

[6]  Wayne Luk,et al.  Synthesis of saturation arithmetic architectures , 2003, TODE.

[7]  Christos-Savvas Bouganis,et al.  Synthesis and Optimization of 2D Filter Designs for Heterogeneous FPGAs , 2009, TRETS.

[8]  L. Villemoes,et al.  A Fast Algorithm for Adapted Time–Frequency Tilings , 1996 .

[9]  John G. Proakis,et al.  Digital signal processing (3rd ed.): principles, algorithms, and applications , 1996 .

[10]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[11]  In-Cheol Park,et al.  Digital filter synthesis based on minimal signed digit representation , 2001, DAC '01.

[12]  Hyun Jin Moon,et al.  Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure , 2005, HPCC.

[13]  Matteo Frigo,et al.  A fast Fourier transform compiler , 1999, SIGP.

[14]  David Gregg,et al.  High Performance Scientific Computing Using FPGAs with IEEE Floating Point and Logarithmic Arithmetic for Lattice QCD , 2006, 2006 International Conference on Field Programmable Logic and Applications.

[15]  Y. C. Pati,et al.  Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition , 1993, Proceedings of 27th Asilomar Conference on Signals, Systems and Computers.

[16]  R. DeVore,et al.  Restricted Nonlinear Approximation , 2000 .

[17]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Franz Franchetti,et al.  Generating FPGA-Accelerated DFT Libraries , 2007 .

[19]  Franz Franchetti,et al.  Formal datapath representation and manipulation for implementing DSP transforms , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[20]  Antonio Cañas,et al.  FPGA Implemenation of Multi-layer Perceptrons for Speech Recognition , 2003, FPL.

[21]  Viktor K. Prasanna,et al.  Sparse Matrix-Vector multiplication on FPGAs , 2005, FPGA '05.

[22]  Steven F. Quigley,et al.  Implementing a simple continuous speech recognition system on an FPGA , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[23]  Jack Dongarra,et al.  Templates for the Solution of Algebraic Eigenvalue Problems , 2000, Software, environments, tools.

[24]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[25]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[26]  Wayne Luk,et al.  Novel FPGA-based implementation of median and weighted median filters for image processing , 2005, International Conference on Field Programmable Logic and Applications, 2005..

[27]  Bruce A. Draper,et al.  One-Step Compilation of Image Processing Applications to FPGAs , 2001, The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'01).

[28]  Zhen Fang,et al.  A low-power accelerator for the SPHINX 3 speech recognition system , 2003, CASES '03.

[29]  Keith D. Underwood,et al.  FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[30]  Ronald R. Coifman,et al.  Entropy-based algorithms for best basis selection , 1992, IEEE Trans. Inf. Theory.

[31]  Olivier Temam,et al.  Characterizing the behavior of sparse algorithms on caches , 1992, Proceedings Supercomputing '92.

[32]  Franz Franchetti,et al.  Generating FPGA-Accelerated DFT Libraries , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[33]  R. DeVore,et al.  Compression of wavelet decompositions , 1992 .