Signal modeling for high-performance robust isolated word recognition

This paper describes speech signal modeling techniques which are well-suited to high performance and robust isolated word recognition. We present new techniques for incorporating spectral/temporal information as a function of the temporal position within each word. In particular, spectral/temporal parameters are computed using both variable length blocks with a variable spacing between blocks. We tested features computed with these methods using an alphabet recognition task based on the ISOLET database. The hidden Markov model toolkit (HTK) was used to implement the isolated word recognizer with whole word HMM models. The best accuracy achieved for speaker independent alphabet recognition, using 50 features, was 97.9%, which represents a new benchmark for this task. We also tested these methods with deliberate signal degradation using additive Gaussian noise and telephone band limiting and found that the recognition degrades gracefully and to a smaller degree than for control cases based on MFCC coefficients and delta cepstra terms.

[1]  Nikos Fakotakis,et al.  Fast endpoint detection algorithm for isolated word recognition in office environment , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Stephen A. Zahorian,et al.  A partitioned neural network approach for vowel classification using smoothed time/frequency features , 1999, IEEE Trans. Speech Audio Process..

[3]  Ronald A. Cole,et al.  Speaker-independent recognition of spoken English letters , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[4]  Andreas Spanias,et al.  High-performance alphabet recognition , 1996, IEEE Trans. Speech Audio Process..

[5]  Ron Cole,et al.  The ISOLET spoken letter database , 1990 .

[6]  Stephen A. Zahorian,et al.  Phone classification with segmental features and a binary-pair partitioned neural network classifier , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Joseph Picone,et al.  Signal modeling techniques in speech recognition , 1993, Proc. IEEE.

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  Stephen A. Zahorian,et al.  Robust feature extraction for alphabet recognition , 1998, ICSLP.

[10]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[11]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[12]  Stephen A. Zahorian,et al.  Signal modeling for isolated word recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  Ben P. Milner,et al.  Inclusion of temporal information into features for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[14]  Thomas W. Parsons,et al.  Voice and Speech Processing , 1986 .

[15]  Saeed Vaseghi,et al.  Dynamic features for segmental speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.