A General Method for Combining Acoustic Features in an Automatic Speech Recognition System

A general method for the use of different types of fea- tures in Automatic Speech Recognition (ASR) systems is presented. A gaussian mixture model (GMM) is ob- tained in a reference acoustic space. A specific fea- ture combination or selection is associated to each gaus- sian of the mixture and used for computing symbol pos- terior probabilities. Symbols can refer to phonemes, phonemes in context or states of a Hidden Markov Model (HMM). Experimental results are presented of applications to phoneme and word rescoring after verification. Two corpora were used, one with small vocab- ularies in Italian and Spanish and one with very large vocabulary in French.

[1]  Rong Zhang,et al.  Word level confidence annotation using combinations of features , 2001, INTERSPEECH.

[2]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[3]  Saeed Vaseghi,et al.  Multi-resolution phonetic/segmental features and models for HMM-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Michael Picheny,et al.  Semantic confidence measurement for spoken dialog systems , 2005, IEEE Transactions on Speech and Audio Processing.

[5]  Katrin Kirchhoff Combining articulatory and acoustic information for speech recognition in noisy and reverberant environments , 1998, ICSLP.

[6]  Melvyn John Hunt Speech recognition, sylabification and statistical phonetics , 2004, INTERSPEECH.

[7]  David L. Thomson,et al.  Use of periodicity and jitter as speech recognition features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Hermann Ney,et al.  Robust speech recognition using a voiced-unvoiced feature , 2002, INTERSPEECH.

[9]  Renato De Mori,et al.  Multiple resolution analysis for robust automatic speech recognition , 2006, Comput. Speech Lang..

[10]  Rajesh M. Hegde,et al.  Speech processing using joint features derived from the modified group delay function , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[11]  Renato De Mori,et al.  Characterizing Feature Variability in Automatic Speech Recognition Systems , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Mark Hasegawa-Johnson,et al.  Maximum mutual information based acoustic-features representation of phonological features for speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Hong Kook Kim,et al.  Why speech recognizers make errors ? a robustness view , 2004, INTERSPEECH.

[14]  Climent Nadeu,et al.  Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system , 2005, IEEE Transactions on Speech and Audio Processing.

[15]  Brian Kingsbury,et al.  Robust speech recognition in Noisy Environments: The 2001 IBM spine evaluation system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Brian Kingsbury,et al.  Constructing ensembles of ASR systems using randomized decision trees , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[17]  Takehito Utsuro,et al.  Confidence of agreement among multiple LVCSR models and model combination by SVM , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[18]  David Gelbart,et al.  Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[19]  Hermann Ney,et al.  Acoustic feature combination for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Andreas Stolcke,et al.  On using MLP features in LVCSR , 2004, INTERSPEECH.

[21]  Hervé Bourlard,et al.  Speech recognition with auxiliary information , 2004, IEEE Transactions on Speech and Audio Processing.

[22]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[23]  Imre Kiss,et al.  Noise robust speech parameterization using multiresolution feature extraction , 2001, IEEE Trans. Speech Audio Process..

[24]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[25]  Andreas Stolcke,et al.  Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.