N-Best Rescoring Based on Pitch-accent Patterns

In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount of data and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.

[1]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[2]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[3]  Jeung-Yoon Choi,et al.  Prosody dependent speech recognition on radio news corpus of American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Andreas Stolcke,et al.  Combining Discriminative Feature, Transform, and Model Training for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Mari Ostendorf,et al.  PROSODY MODELS FOR CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[6]  Shrikanth S. Narayanan,et al.  Combining lexical, syntactic and prosodic cues for improved online dialog act tagging , 2009, Comput. Speech Lang..

[7]  Andreas Stolcke,et al.  Prosodic knowledge sources for automatic speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Andreas Stolcke,et al.  Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Greg Kochanski,et al.  Quantitative modelling of intonational variation , 2004 .

[10]  György Szaszák,et al.  Speech Recognition Supported by Prosodic Information for Fixed Stress Languages , 2007, TSD.

[11]  Julia Hirschberg,et al.  Prosody, emotions, and... 'whatever' , 2007, INTERSPEECH.

[12]  Grzegorz Kondrak,et al.  On the Syllabification of Phonemes , 2009, NAACL.

[13]  Shrikanth S. Narayanan,et al.  Exploiting Acoustic and Syntactic Features for Automatic Prosody Labeling in a Maximum Entropy Framework , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Patrick Kenny,et al.  Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Shrikanth S. Narayanan,et al.  Continuous speech recognition using attention shift decoding with soft decision , 2009, INTERSPEECH.

[16]  Yang Liu,et al.  Syllable-level prominence detection with acoustic evidence , 2010, INTERSPEECH.

[17]  Julia Hirschberg,et al.  Predicting Automatic Speech Recognition Performance Using Prosodic Cues , 2000, ANLP.

[18]  Gökhan Tür,et al.  Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[19]  Shrikanth S. Narayanan,et al.  Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[21]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[22]  Yang Liu,et al.  Automatic prosodic events detection using syllable-based acoustic and syntactic features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Shrikanth S. Narayanan,et al.  Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N-Best Rescoring Framework , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  Julia Hirschberg,et al.  Story Segmentation of Broadcast News in English, Mandarin and Arabic , 2006, NAACL.