Modeling phrasing and prominence using deep recurrent learning

Models for the prediction of prosodic events, such as pitch accents and phrasal boundaries, often rely on machine learning models that combine a set of input features aggregated over a finite, and usually short, number of observations to model context. Dynamic models go a step further by explicitly incorporating a model of state sequence, but even then, many practical implementations are limited to a low-order finite-state machine. This Markovian assumption, however, does not properly address the interaction between shortand long-term contextual factors that is known to affect the realization and placement of these prosodic events. Bidirectional Recurrent Neural Networks (BiRNNs) are a class of models that overcome this limitation by predicting the outputs as a function of a state variable that accumulates information over the entire input sequence, and by stacking several layers to form a deep architecture able to extract more structure from the input features. These models have already demonstrated state-of-the-art performance on some prosodic regression tasks. In this work we examine a new application of BiRNNs to the task of classifying categorical prosodic events, and demonstrate that they outperform baseline systems.

[1]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2]  Gina-Anne Levow,et al.  Context in multi-lingual tone and pitch accent recognition , 2005, INTERSPEECH.

[3]  Bhuvana Ramabhadran,et al.  Phrase Boundary Assignment from Text in Multiple Domains , 2012, INTERSPEECH.

[4]  Mari Ostendorf,et al.  Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[5]  S. Shattuck-Hufnagel,et al.  Perceptual Robustness of the Tonal Center of Gravity for Contour Classification , 2009 .

[6]  Mattias Heldner,et al.  An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Yasemin Altun,et al.  Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[8]  V. V. van Heuven,et al.  Spectral balance as a cue in the perception of linguistic stress. , 1997, The Journal of the Acoustical Society of America.

[9]  Shrikanth S. Narayanan,et al.  An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Andrew Rosenberg,et al.  AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[11]  Taniya Mishra,et al.  Word Prominence Detection using Robust yet Simple Prosodic Features , 2012, INTERSPEECH.

[12]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[13]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[14]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[15]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[16]  Julia Hirschberg,et al.  Detecting Pitch Accents at the Word, Syllable and Vowel Level , 2009, NAACL.

[17]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[18]  Bhuvana Ramabhadran,et al.  Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data , 2010, INTERSPEECH.

[19]  Andrew Rosenberg,et al.  Automatic detection and classification of prosodic events , 2009 .

[20]  Rudi C. Villing,et al.  Automatic Blind Syllable Segmentation for Continuous Speech , 2004 .

[21]  Paul Taylor,et al.  The tilt intonation model , 1998, ICSLP.

[22]  Bhuvana Ramabhadran,et al.  Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[23]  Andrew Rosenberg Modeling intensity contours and the interaction of pitch and intensity to improve automatic prosodic event detection and classification , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).