A 40 nm 144 mW VLSI Processor for Real-Time 60-kWord Continuous Speech Recognition

We have developed a low-power VLSI chip for 60-kWord real-time continuous speech recognition based on a context-dependent hidden Markov model (HMM). Our implementation includes a cache architecture using locality of speech recognition, beam pruning using a dynamic threshold, two-stage language model searching, highly parallel Gaussian mixture model (GMM) computation based on the mixture level, a variable-frame look-ahead scheme, and elastic pipeline operation between the Viterbi transition and GMM processing. The accuracy degradation of the important parameters in Viterbi computation is strictly discussed. Results show that our implementation achieves 95% bandwidth reduction (70.86 MB/s) and 78% required frequency reduction (126.5 MHz) comparing to the referential Julius system. The test chip, fabricated using 40 nm CMOS technology, contains 1.9 M transistors for logic and 7.8 Mbit on-chip memory. It dissipates 144 mW at 126.5 MHz and 1.1 V for 60-kWord real-time continuous speech recognition.

[1]  Wonyong Sung,et al.  FPGA-based implementation of a real-time 5000-word continuous speech recognizer , 2008, 2008 16th European Signal Processing Conference.

[2]  Tao Ma,et al.  Novel CI-backoff scheme for real-time embedded speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Frank Seide,et al.  Fast likelihood computation for continuous-mixture densities using a tree-based nearest neighbor search , 1995, EUROSPEECH.

[4]  Rob A. Rutenbar,et al.  Profiling large-vocabulary continuous speech recognition on embedded devices: a hardware resource sensitivity analysis , 2009, INTERSPEECH.

[5]  J.H.L. Hansen,et al.  Fast likelihood computation techniques in nearest-neighbor based search for continuous speech recognition , 2001, IEEE Signal Processing Letters.

[6]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..

[7]  Wonyong Sung,et al.  VLSI for 5000-word continuous speech recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Masahiko Yoshimoto,et al.  VLSI Architecture of GMM Processing and Viterbi Decoder for 60, 000-Word Real-Time Continuous Speech Recognition , 2011, IEICE Trans. Electron..

[9]  Wonyong Sung,et al.  A Real-Time FPGA-Based 20 000-Word Speech Recognizer With Optimized DRAM Access , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[10]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[11]  Rob A. Rutenbar,et al.  A 1000-word vocabulary, speaker-independent, continuous live-mode speech recognizer implemented in a single FPGA , 2007, FPGA '07.

[12]  Naoya Wada,et al.  Scalable architecture for word HMM-based speech recognition and VLSI implementation in complete system , 2006, IEEE Transactions on Circuits and Systems I: Regular Papers.

[13]  Sang Il Park,et al.  A 1.35V 4.3GB/s 1Gb LPDDR2 DRAM with controllable repeater and on-the-fly power-cut scheme for low-power and high-speed mobile application , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[14]  Shintaro Izumi,et al.  A 40 nm 144 mW VLSI processor for realtime 60 kWord continuous speech recognition , 2011, 2011 IEEE Custom Integrated Circuits Conference (CICC).

[15]  Keith Baker,et al.  Shmoo Plotting: The Black Art of IC Testing , 1997, IEEE Des. Test Comput..

[16]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[17]  Rob A. Rutenbar,et al.  A multi-fpga 10x-real-time high-speed search engine for a 5000-word vocabulary speech recognizer , 2009, FPGA '09.