Scalable architecture for word HMM-based speech recognition and VLSI implementation in complete system

This paper describes a scalable architecture for real-time speech recognizers based on word hidden Markov models (HMMs) that provide high recognition accuracy for word recognition tasks. However, the size of their recognition vocabulary is small because its extremely high computational costs cause long processing times. To achieve high-speed operations, we developed a VLSI system that has a scalable architecture. The architecture effectively uses parallel computations on the word HMM structure. It can reduce processing time and/or extend the word vocabulary. To explore the practicality of our architecture, we designed and evaluated a complete system recognizer, including speech analysis and noise robustness parts, on a 0.18-/spl mu/m CMOS standard cell library and field-programmable gate array. In the CMOS standard-cell implementation, the total processing time is 56.9 /spl mu/s/word at an operating frequency of 80 MHz in a single system. The recognizer gives a real-time response using an 800-word vocabulary.

[1]  Magne Hallstein Johnsen,et al.  A VLSI implementation of PDF computations in HMM based speech recognition , 1996, Proceedings of Digital Processing Applications (TENCON '96).

[2]  Mark J. F. Gales,et al.  Use of Gaussian selection in large vocabulary continuous speech recognition using HMMS , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  S. Itahashi A Japanese language speech database , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Kiyohiro Shikano,et al.  A new phonetic tied-mixture model for efficient decoding , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[6]  Koichi Shinoda,et al.  Speech recognition using tree-structured probability density function , 1994, ICSLP.

[7]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[8]  Hermann Ney,et al.  Data driven search organization for continuous speech recognition , 1992, IEEE Trans. Signal Process..

[9]  Fabian Vargas,et al.  A FPGA-based Viterbi algorithm implementation for speech recognition systems , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Satoshi Takahashi,et al.  On the use of scalar quantization for fast HMM computation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[12]  Oliver Chiu-sing Choy,et al.  An HMM-based speech recognition IC , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[13]  Steven F. Quigley,et al.  Implementing a simple continuous speech recognition system on an FPGA , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Kiyohiro Shikano,et al.  Gaussian mixture selection using context-independent HMM , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Peter Beyerlein,et al.  Fast log-likelihood computation for mixture densities in a high-dimensional feature space , 1994, ICSLP.

[17]  Stephen A. Zahorian,et al.  Signal modeling for isolated word recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[18]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[19]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[20]  Alex Acero,et al.  Spoken Language Processing , 2001 .