Speech recognition using fundamental frequency and voicing in acoustic modeling

Prosody has long been studied as a knowledge source in speech processing. We attempt to directly exploit prosodic correlates in acoustic modeling of speech for large vocabulary recognition. We compare two methods for using the fundamental frequency and voicing parameters. The more complex approach starts by modeling prosodic classes and using a representation of their recognized sequences as acoustic features. The simpler approach simply adds suitably normalized raw values to the conventional mel cepstral coefficients in the observation vectors. The simpler approach achieves modest accuracy gains on HUB-5 Eval-2001 test set.

[1]  Elmar Nöth,et al.  Dialog act classification with the help of prosody , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  Julia Hirschberg,et al.  Acoustic indicators of topic segmentation , 1998, ICSLP.

[3]  Elmar Nöth,et al.  Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground , 2001, INTERSPEECH.

[4]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[5]  Elizabeth Shriberg,et al.  Using prosodic and lexical information for speaker identification , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Andrej Ljolje,et al.  Recognition of isolated prosodic patterns using Hidden Markov Models , 1987 .

[7]  Andrej Ljolje,et al.  The AT&T LVCSR-2000 System , 2000 .

[8]  Andrej Ljolje,et al.  Modelling of speech using primarily prosodic parameters , 1987 .

[9]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).