论文信息 - Modeling phrasing and prominence using deep recurrent learning

Modeling phrasing and prominence using deep recurrent learning

Models for the prediction of prosodic events, such as pitch accents and phrasal boundaries, often rely on machine learning models that combine a set of input features aggregated over a finite, and usually short, number of observations to model context. Dynamic models go a step further by explicitly incorporating a model of state sequence, but even then, many practical implementations are limited to a low-order finite-state machine. This Markovian assumption, however, does not properly address the interaction between shortand long-term contextual factors that is known to affect the realization and placement of these prosodic events. Bidirectional Recurrent Neural Networks (BiRNNs) are a class of models that overcome this limitation by predicting the outputs as a function of a state variable that accumulates information over the entire input sequence, and by stacking several layers to form a deep architecture able to extract more structure from the input features. These models have already demonstrated state-of-the-art performance on some prosodic regression tasks. In this work we examine a new application of BiRNNs to the task of classifying categorical prosodic events, and demonstrate that they outperform baseline systems.

Bhuvana Ramabhadran | Andrew Rosenberg | Raul Fernandez

[1] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2] Gina-Anne Levow,et al. Context in multi-lingual tone and pitch accent recognition , 2005, INTERSPEECH.

[3] Bhuvana Ramabhadran,et al. Phrase Boundary Assignment from Text in Multiple Domains , 2012, INTERSPEECH.

[4] Mari Ostendorf,et al. Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[5] S. Shattuck-Hufnagel,et al. Perceptual Robustness of the Tonal Center of Gravity for Contour Classification , 2009 .

[6] Mattias Heldner,et al. An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] Yasemin Altun,et al. Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[8] V. V. van Heuven,et al. Spectral balance as a cue in the perception of linguistic stress. , 1997, The Journal of the Acoustical Society of America.

[9] Shrikanth S. Narayanan,et al. An automatic prosody recognizer using a coupled multi-stream acoustic model and a syntactic-prosodic language model , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10] Andrew Rosenberg,et al. AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[11] Taniya Mishra,et al. Word Prominence Detection using Robust yet Simple Prosodic Features , 2012, INTERSPEECH.

[12] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[13] Alex Graves,et al. Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[14] Andrew McCallum,et al. An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[15] Jürgen Schmidhuber,et al. Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[16] Julia Hirschberg,et al. Detecting Pitch Accents at the Word, Syllable and Vowel Level , 2009, NAACL.

[17] Mari Ostendorf,et al. TOBI: a standard for labeling English prosody , 1992, ICSLP.

[18] Bhuvana Ramabhadran,et al. Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data , 2010, INTERSPEECH.

[19] Andrew Rosenberg,et al. Automatic detection and classification of prosodic events , 2009 .

[20] Rudi C. Villing,et al. Automatic Blind Syllable Segmentation for Continuous Speech , 2004 .

[21] Paul Taylor,et al. The tilt intonation model , 1998, ICSLP.

[22] Bhuvana Ramabhadran,et al. Prosody contour prediction with long short-term memory, bi-directional, deep recurrent neural networks , 2014, INTERSPEECH.

[23] Andrew Rosenberg. Modeling intensity contours and the interaction of pitch and intensity to improve automatic prosodic event detection and classification , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).