PocketSUMMIT: small-footprint continuous speech recognition

We present PocketSUMMIT, a small-footprint version of our SUMMIT continuous speech recognition system. With portable devices becoming smaller and more powerful, speech is increasingly becoming an important input modality on these devices. PocketSUMMIT is implemented as a variable-rate continuous density hidden Markov model with diphone context-dependent models. We explore various Gaussian parameter quantization schemes and find 8:1 compression or more is achievable with little reduction in accuracy. We also show how the quantized parameters can be used for rapid table lookup. We explore firstpass language model pruning in a finite-state transducer (FST) framework, as well as FST and n-gram weight quantization and bit packing, to further reduce memory usage. PocketSUMMIT is currently able to run a moderate vocabulary conversational speech recognition system in real time in a few MB on current PDAs and smart phones. Index Terms: speech recognition, small footprint, parameter quantization, finite-state transducer

[1]  Hans J. G. A. Dolfing,et al.  Incremental language models for speech recognition using finite-state transducers , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[2]  Enrico Bocchieri,et al.  A Decoder for Lvcsr Based on Fixed-Point Arithmetic , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Miroslav Novak,et al.  Towards large vocabulary ASR on embedded platforms , 2004, INTERSPEECH.

[4]  Marcel Vasilache,et al.  Speech recognition using HMMs with quantized parameters , 2000, INTERSPEECH.

[5]  I. Lee Hetherington,et al.  A multi-pass, dynamic-vocabulary approach to real-time, large-vocabulary speech recognition , 2005, INTERSPEECH.

[6]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[7]  Erik McDermott,et al.  Minimum classification error training of landmark models for real-time continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  James R. Glass A probabilistic framework for segment-based speech recognition , 2003, Comput. Speech Lang..

[9]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[10]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[11]  Alexander I. Rudnicky,et al.  Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Victor Zue,et al.  JUPlTER: a telephone-based conversational interface for weather information , 2000, IEEE Trans. Speech Audio Process..