Graphspeech: Syntax-Aware Graph Attention Network for Neural Speech Synthesis

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to conventional statistical methods in many ways. Transformer-based TTS is one of such successful implementations. While Transformer TTS models the speech frame sequence well with a self-attention mechanism, it does not associate input text with output utterances from a syntactic point of view at sentence level. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. GraphSpeech encodes explicitly the syntactic relation of input lexical tokens in a sentence, and incorporates such information to derive syntactically motivated character embeddings for TTS attention mechanism. Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.

[1]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[2]  Hai Zhao,et al.  Global Greedy Dependency Parsing , 2020, AAAI.

[3]  Haizhou Li,et al.  Exploiting Morphological and Phonological Features to Improve Prosodic Phrasing for Mongolian Speech Synthesis , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[5]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[6]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Haizhou Li,et al.  Teacher-Student Training For Robust Tacotron-Based TTS , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jing Xiao,et al.  GraphTTS: Graph-to-Sequence Modelling in Neural Text-to-Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Zhuosheng Zhang,et al.  SG-Net: Syntax-Guided Machine Reading Comprehension , 2019, AAAI.

[10]  Lei Xie,et al.  Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis , 2020, INTERSPEECH.

[11]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[12]  Berrak Sisman,et al.  Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Shinnosuke Takamichi,et al.  Acoustic model-based subword tokenization and prosodic-context extraction without language knowledge for text-to-speech synthesis , 2020, Speech Commun..

[14]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Deng Cai,et al.  Graph Transformer for Graph-to-Sequence Learning , 2019, AAAI.

[17]  Haizhou Li,et al.  Expressive TTS Training with Frame and Style Reconstruction Loss , 2020, ArXiv.

[18]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[19]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[20]  Haizhou Li,et al.  Wavelet Analysis of Speaker Dependent and Independent Prosody for Voice Conversion , 2018, INTERSPEECH.

[21]  Haizhou Li,et al.  WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss , 2020, ArXiv.

[22]  Frank K. Soong,et al.  Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[24]  Hui Zhang,et al.  A LSTM Approach with Sub-Word Embeddings for Mongolian Phrase Break Prediction , 2018, COLING.

[25]  Lei Xie,et al.  On the localness modeling for the self-attention based end-to-end speech synthesis , 2020, Neural Networks.

[26]  Yuxuan Wang,et al.  Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron , 2018, ICML.

[27]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Hui Zhang,et al.  Improving Mongolian Phrase Break Prediction by Using Syllable and Morphological Embeddings with BiLSTM Model , 2018, INTERSPEECH.

[29]  Yuji Matsumoto MaltParser: A language-independent system for data-driven dependency parsing , 2005 .

[30]  Samy Bengio,et al.  Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model , 2017, ArXiv.

[31]  Simon King,et al.  An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Haizhou Li,et al.  Modeling Prosodic Phrasing With Multi-Task Learning in Tacotron-Based TTS , 2020, IEEE Signal Processing Letters.

[33]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  Heiga Zen,et al.  Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).