Joint Detection of Sentence Stress and Phrase Boundary for Prosody

Prosodic event detection plays an important role in spoken language processing tasks and Computer-Assisted Pronunciation Training (CAPT) systems [1]. Traditional methods for the detection of sentence stress and phrase boundaries rely on machine learning methods that model limited contextual information and account little for interaction between these two prosodic events. In this paper, we propose a hierarchical network modeling the contextual factors at the granularity of phoneme, syllable and word based on bidirectional Long ShortTerm Memory (BLSTM). Moreover, to account for the inherent connection between sentence stress and phrase boundaries, we perform a joint modeling of these two important prosodic events with a multitask learning framework (MTL) which shares common prosodic features. We evaluate the network performance based on Aix-Machine Readable Spoken English Corpus (AixMARSEC). Experimental results show our proposed method obtains the F1-measure of 90% for sentence stress detection and 91% for phrase boundary detection, which outperforms the baseline utilizing conditional random field (CRF) by about 4% and 9% respectively.

[1]  Ngoc Thang Vu,et al.  Prosodic Event Recognition Using Convolutional Neural Networks with Context Information , 2017, INTERSPEECH.

[2]  Savoirs textes et langage Cyril Auran,et al.  Aix-MARSEC database , 2008 .

[3]  Yu Zhang,et al.  A Survey on Multi-Task Learning , 2017, IEEE Transactions on Knowledge and Data Engineering.

[4]  G. Ayers,et al.  Guidelines for ToBI labelling , 1994 .

[5]  Xu Li,et al.  Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks , 2018, Speech Commun..

[6]  Yang Liu,et al.  Automatic prosodic events detection using syllable-based acoustic and syntactic features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Gary Geunbae Lee,et al.  Automatic sentence stress feedback for non-native English learners , 2017, Comput. Speech Lang..

[8]  Louis Goldstein,et al.  The coordination of boundary tones and its interaction with prominence , 2014, J. Phonetics.

[9]  Bhuvana Ramabhadran,et al.  Modeling phrasing and prominence using deep recurrent learning , 2015, INTERSPEECH.

[10]  Hongyan Li,et al.  Automatic Pitch Accent Detection Using Long Short-Term Memory Neural Networks , 2019, SSPS 2019.

[11]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Gina-Anne Levow,et al.  Context in multi-lingual tone and pitch accent recognition , 2005, INTERSPEECH.

[14]  Yasemin Altun,et al.  Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[15]  Antje Schweitzer,et al.  Experiments on automatic prosodic labeling , 2009, INTERSPEECH.

[16]  Eric Atwell,et al.  An approach for detecting prosodic phrase boundaries in spoken english , 2007, ACM Crossroads.

[17]  T. F. Mitchell David Abercrombie, Elements of General Phonetics . Edinburgh: Edinburgh University Press, 1966. Pp. 203. , 1969 .

[18]  Hua Yuan,et al.  Exploiting contextual information for prosodic event detection using auto-context , 2013, EURASIP J. Audio Speech Music. Process..

[19]  Bayya Yegnanarayana,et al.  Extraction and representation of prosodic features for language and speaker recognition , 2008, Speech Commun..

[20]  Zhizheng Wu,et al.  Automatic prosody prediction and detection with Conditional Random Field (CRF) models , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[21]  Xu Li,et al.  Phoneme Embedding and its Application to Speech Driven Talking Avatar Synthesis , 2016, INTERSPEECH.

[22]  Abeer Alwan,et al.  Effects of intonational phrase boundaries on pitch-accented syllables in american English , 2008, INTERSPEECH.

[23]  Dwight L. Bolinger,et al.  Intonation and Its Uses: Melody in Grammar and Discourse , 1989 .

[24]  Julia Hirschberg,et al.  Detecting Pitch Accents at the Word, Syllable and Vowel Level , 2009, NAACL.

[25]  Bhuvana Ramabhadran,et al.  Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data , 2010, INTERSPEECH.

[26]  Wolfgang Wokurek,et al.  Pitch accent classification of fundamental frequency contours by hidden Markov models , 1995, EUROSPEECH.

[27]  Shuang Zhang,et al.  Detection of intonation in L2 English speech of native Mandarin learners , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[28]  Andrew Rosenberg,et al.  Automatic detection and classification of prosodic events , 2009 .