An Ultra Low-Power Hardware Accelerator for Acoustic Scoring in Speech Recognition

Accurate, real-time Automatic Speech Recognition (ASR) comes at a high energy cost, so accuracy has often to be sacrificed in order to fit the strict power constraints of mobile systems. However, accuracy is extremely important for the end-user, and today's systems are still unsatisfactory for many applications. The most critical component of an ASR system is the acoustic scoring, as it has a large impact on the accuracy of the system and takes up the bulk of execution time. The vast majority of ASR systems implement the acoustic scoring by means of Gaussian Mixture Models (GMMs), where the acoustic scores are obtained by evaluating multidimensional Gaussian distributions.In this paper, we propose a hardware accelerator for GMM evaluation that reduces the energy required for acoustic scoring by three orders of magnitude compared to solutions based on CPUs and GPUs. Our accelerator implements a lazy evaluation scheme where Gaussians are computed on demand, avoiding 50% of the computations. Furthermore, it employs a novel clustering scheme to reduce the size of the acoustic model, which results in 8x memory bandwidth savings with a negligible impact on accuracy. Finally, it includes a novel memoization scheme that avoids 74.88% of floating-point operations. The end design provides a 164x speedup and 3532x energy reduction when compared with a highly-tuned implementation running on a modern mobile CPU. Compared to a state-of-the-art mobile GPU, the GMM accelerator achieves 5.89x speedup over a highly optimized CUDA implementation, while reducing energy by 241x.

[1]  Vassilios Digalakis,et al.  Quantization of cepstral parameters for speech recognition over the World Wide Web , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Zhen Fang,et al.  CogniServe: Heterogeneous Server Architecture for Large-Scale Recognition , 2011, IEEE Micro.

[3]  Sadaoki Furui,et al.  Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition , 2009, Comput. Speech Lang..

[4]  Lalit R. Bahl,et al.  Further results on the recognition of a continuously read natural corpus , 1980, ICASSP.

[5]  R Farnoush,et al.  Image Segmentation using Gaussian Mixture Model , 2008 .

[6]  Enrico Bocchieri,et al.  Vector quantization for the efficient computation of continuous density likelihoods , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  A. Zempléni IMAGE RETRIEVAL USING GAUSSIAN MIXTURE MODELS , 2011 .

[8]  Jose-Maria Arnau,et al.  Low-Power Automatic Speech Recognition Through a Mobile GPU and a Viterbi Accelerator , 2017, IEEE Micro.

[9]  Zhen Fang,et al.  A low-power accelerator for the SPHINX 3 speech recognition system , 2003, CASES '03.

[10]  Roberto Bisiani,et al.  Sub-vector clustering to improve memory and speed performance of acoustic likelihood computation , 1997, EUROSPEECH.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[13]  John D. Owens,et al.  Three-layer optimizations for fast GMM computations on GPU-like parallel processors , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[14]  Zhen Fang,et al.  ISIS: An accelerator for Sphinx speech recognition , 2011, 2011 IEEE 9th Symposium on Application Specific Processors (SASP).

[15]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[16]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Jose-Maria Arnau,et al.  An ultra low-power hardware accelerator for automatic speech recognition , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Paul D. Franzon,et al.  Architecture for Low Power Large Vocabulary Speech Recognition , 2006, 2006 IEEE International SOC Conference.

[19]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[20]  Cecilia Laschi,et al.  Fast estimation of Gaussian mixture models for image segmentation , 2011, Machine Vision and Applications.

[21]  Prakash Chockalingam,et al.  Non-rigid multi-modal object tracking using Gaussian mixture models , 2009 .

[22]  Tatsuya Kawahara,et al.  Recent Development of Open-Source Speech Recognition Engine Julius , 2009 .

[23]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  R. Farnoosh,et al.  IMAGE SEGMENTATION USING GAUSSIAN MIXTURE MODEL , 2008 .

[25]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .