A robust high accuracy speech recognition system for mobile applications

This paper describes a robust, accurate, efficient, low-resource, medium-vocabulary, grammar-based speech recognition system using hidden Markov models for mobile applications. Among the issues and techniques we explore are improving robustness and efficiency of the front-end, using multiple microphones for removing extraneous signals from speech via a new multichannel CDCN technique, reducing computation via silence detection, applying the Bayesian information criterion (BIC) to build smaller and better acoustic models, minimizing finite state grammars, using hybrid maximum likelihood and discriminative models, and automatically generating baseforms from single new-word utterances.

[1]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.

[2]  Mehryar Mohri,et al.  Finite-State Transducers in Language and Speech Processing , 1997, CL.

[3]  Ramesh A. Gopinath,et al.  Model selection in acoustic modeling , 1999, EUROSPEECH.

[4]  Bhuvana Ramabhadran,et al.  Acoustics-only based automatic phonetic baseform generation , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Alexander H. Waibel,et al.  Model-combination-based acoustic mapping , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Bhaskar D. Rao,et al.  Techniques for capturing temporal variations in speech signals with fixed-rate processing , 1998, ICSLP.

[7]  Benoît Maison,et al.  Automatic generation and selection of multiple pronunciations for dynamic vocabularies , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[8]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Michael Picheny,et al.  Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  I. Lee Hetherington,et al.  Keyword-based discriminative training of acoustic models , 2000, INTERSPEECH.

[11]  Meir Feder,et al.  Multi-channel signal separation by decorrelation , 1993, IEEE Trans. Speech Audio Process..

[12]  Joel F. Bartlett,et al.  Itsy: Stretching the Bounds of Mobile Computing , 2001, Computer.

[13]  Dimitri Kanevsky,et al.  An inequality for rational functions with applications to some statistical estimation problems , 1991, IEEE Trans. Inf. Theory.

[14]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[15]  Ramesh A. Gopinath,et al.  The IBM Personal Speech Assistant , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Eduardo Lleida,et al.  Speech recognition using automatically derived acoustic baseforms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[19]  Yves Normandin,et al.  Hidden Markov models, maximum mutual information estimation, and the speech recognition problem , 1992 .

[20]  Daniel Povey,et al.  Large scale discriminative training of hidden Markov models for speech recognition , 2002, Comput. Speech Lang..

[21]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .