论文信息 - N-Best Rescoring Based on Pitch-accent Patterns

N-Best Rescoring Based on Pitch-accent Patterns

In this paper, we adopt an n-best rescoring scheme using pitch-accent patterns to improve automatic speech recognition (ASR) performance. The pitch-accent model is decoupled from the main ASR system, thus allowing us to develop it independently. N-best hypotheses from recognizers are rescored by additional scores that measure the correlation of the pitch-accent patterns between the acoustic signal and lexical cues. To test the robustness of our algorithm, we use two different data sets and recognition setups: the first one is English radio news data that has pitch accent labels, but the recognizer is trained from a small amount of data and has high error rate; the second one is English broadcast news data using a state-of-the-art SRI recognizer. Our experimental results demonstrate that our approach is able to reduce word error rate relatively by about 3%. This gain is consistent across the two different tests, showing promising future directions of incorporating prosodic information to improve speech recognition.

[1] Mari Ostendorf,et al. Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[2] Avrim Blum,et al. The Bottleneck , 2021, Monopsony Capitalism.

[3] Jeung-Yoon Choi,et al. Prosody dependent speech recognition on radio news corpus of American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4] Andreas Stolcke,et al. Combining Discriminative Feature, Transform, and Model Training for Large Vocabulary Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5] Mari Ostendorf,et al. PROSODY MODELS FOR CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[6] Shrikanth S. Narayanan,et al. Combining lexical, syntactic and prosodic cues for improved online dialog act tagging , 2009, Comput. Speech Lang..

[7] Andreas Stolcke,et al. Prosodic knowledge sources for automatic speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8] Andreas Stolcke,et al. Recent innovations in speech-to-text transcription at SRI-ICSI-UW , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9] Greg Kochanski,et al. Quantitative modelling of intonational variation , 2004 .

[10] György Szaszák,et al. Speech Recognition Supported by Prosodic Information for Fixed Stress Languages , 2007, TSD.

[11] Julia Hirschberg,et al. Prosody, emotions, and... 'whatever' , 2007, INTERSPEECH.

[12] Grzegorz Kondrak,et al. On the Syllabification of Phonemes , 2009, NAACL.

[13] Shrikanth S. Narayanan,et al. Exploiting Acoustic and Syntactic Features for Automatic Prosody Labeling in a Maximum Entropy Framework , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Patrick Kenny,et al. Modeling Prosodic Features With Joint Factor Analysis for Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Shrikanth S. Narayanan,et al. Continuous speech recognition using attention shift decoding with soft decision , 2009, INTERSPEECH.

[16] Yang Liu,et al. Syllable-level prominence detection with acoustic evidence , 2010, INTERSPEECH.

[17] Julia Hirschberg,et al. Predicting Automatic Speech Recognition Performance Using Prosodic Cues , 2000, ANLP.

[18] Gökhan Tür,et al. Prosody-based automatic segmentation of speech into sentences and topics , 2000, Speech Commun..

[19] Shrikanth S. Narayanan,et al. Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[20] Andreas Stolcke,et al. Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[21] Treebank Penn,et al. Linguistic Data Consortium , 1999 .

[22] Yang Liu,et al. Automatic prosodic events detection using syllable-based acoustic and syntactic features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23] Shrikanth S. Narayanan,et al. Improved Speech Recognition using Acoustic and Lexical Correlates of Pitch Accent in a N-Best Rescoring Framework , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24] Julia Hirschberg,et al. Story Segmentation of Broadcast News in English, Mandarin and Arabic , 2006, NAACL.