Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information

Phrasing structure is one of the most important factors in increasing the naturalness of text-to-speech (TTS) systems, in particular for long-form reading. Most existing TTS systems are optimized for isolated short sentences, and completely discard the larger context or structure of the text. This paper presents how we have built phrasing models based on data extracted from audiobooks. We investigate how various types of textual features can improve phrase break prediction: part-of-speech (POS), guess POS (GPOS), dependency tree features and word embeddings. These features are fed into a bidirectional LSTM or a CART baseline. The resulting systems are compared using both objective and subjective evaluations. Using BiLSTM and word embeddings proves to be beneficial.

[1]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[2]  Alan W. Black,et al.  A Grammar Based Approach to Style Specific Phrase Prediction , 2011, INTERSPEECH.

[3]  Mari Ostendorf,et al.  A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location , 1994, CL.

[4]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  David Escudero Mancebo,et al.  Filled Pauses in Speech Synthesis: Towards Conversational Speech , 2007, TSD.

[7]  Alok Parlikar Style-Specific Phrasing in Speech Synthesis , 2013 .

[8]  Suryakanth V. Gangashetty,et al.  An Investigation of Recurrent Neural Network Architectures Using Word Embeddings for Phrase Break Prediction , 2016, INTERSPEECH.

[9]  P MarcusMitchell,et al.  Building a large annotated corpus of English , 1993 .

[10]  Alan W. Black,et al.  Data-driven phrasing for speech synthesis in low-resource languages , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[13]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[14]  Oliver Watts,et al.  Neural net word representations for phrase-break prediction without a part of speech tagger , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[16]  Antonio Bonafonte,et al.  Prosodic Break Prediction with RNNs , 2016, IberSPEECH.

[17]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[18]  Julia Hirschberg,et al.  Automatic classification of intonational phrase boundaries , 1992 .

[19]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Andrew Rosenberg,et al.  AutoBI - a tool for automatic toBI annotation , 2010, INTERSPEECH.

[22]  Mark Fishel,et al.  Modelling the temporal structure of newsreaders' speech on neural networks for Estonian text-to-speech synthesis , 2006 .

[23]  Jens Apel,et al.  Have a break ! Modelling pauses in German Speech , 2004 .

[24]  Hiroyuki Shindo,et al.  A latent variable model for joint pause prediction and dependency parsing , 2015, INTERSPEECH.