Prosody boundary detection through context-dependent position models

In this paper, we propose to convert the prosody boundary detection task into a syllable position labeling task. In order to detect both prosodic word and prosodic phrase boundaries, 6 types of syllable positions are defined. For each position, context-dependent position models are trained from manually labeled data. These models are used to label syllable positions in unseen speech. Word and phrase boundaries are then easily derived from syllable position labels. The proposed approach is tested with a large scale single speaker database. The precision and recall for word boundary are 96.1% and 90.1%, respectively, and for phrase boundary are 83.7% and 80.5%, respectively. Results of a listening test shows that only 28% of word boundaries and 50% of phrase of boundaries detected automatically are critical error, implying only about 2.2% and 10% errors for word and phrase boundaries, respectively. The results are rather good, especially when it is considered that only acoustic features are used in this work.

[1]  Mark Hasegawa-Johnson,et al.  An automatic prosody labeling system using ANN-based syntactic-prosodic model and GMM-based acoustic-prosodic model , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Mari Ostendorf,et al.  Automatic recognition of prosodic phrases , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Keikichi Hirose,et al.  Detection of prosodic word boundaries by statistical modeling of mora transitions of fundamental frequency contours and its use for continuous speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Wayne A. Lea,et al.  A prosodically guided speech understanding strategy , 1975 .

[6]  Ye Tian,et al.  Tone articulation modeling for Mandarin spontaneous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Hiroshi Shimodaira,et al.  Prosodic phrase segmentation by pitch pattern clustering , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Mangui Liang,et al.  Detecting tone errors in continuous Mandarin speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Chao Huang,et al.  Exploring tonal variations via context-dependent tone models , 2007, INTERSPEECH.

[10]  Keikichi Hirose,et al.  Representing prosodic words using statistical models of moraic transition of fundamental frequency contours of Japanese , 1998, ICSLP.