Generating the Voice of the Interactive Virtual Assistant

This chapter introduces an overview of the current approaches for generating spoken content using text-to-speech synthesis (TTS) systems, and thus the voice of an Interactive Virtual Assistant (IVA). The overview builds upon the issues which make spoken content generation a non-trivial task, and introduces the two main components of a TTS system: text processing and acoustic modelling. It then focuses on providing the reader with the minimally required scientific details of the terminology and methods involved in speech synthesis, yet with sufficient knowledge so as to be able to make the initial decisions regarding the choice of technology for the vocal identity of the IVA. The speech synthesis methodologies’ description begins with the basic, easy to run, low-requirement rule-based synthesis, and ends up within the state-of-the-art deep learning landscape. To bring this extremely complex and extensive research field closer to commercial deployment, an extensive indexing of the readily and freely available resources and tools required to build a TTS system is provided. Quality evaluation methods and open research problems are, as well, highlighted at end of the chapter.

[1]  Sercan Ömer Arik,et al.  Deep Voice 3: 2000-Speaker Neural Text-to-Speech , 2017, ICLR 2018.

[2]  Sungwon Kim,et al.  Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search , 2020, NeurIPS.

[3]  Massimo Giustiniani,et al.  A hidden Markov model approach to speech synthesis , 1989, EUROSPEECH.

[4]  Gonçalo Simões,et al.  Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings , 2018, ACL.

[5]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[6]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Christian Dittmar,et al.  A Comparison of Recent Neural Vocoders for Speech Signal Reconstruction , 2019, 10th ISCA Workshop on Speech Synthesis (SSW 10).

[8]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[9]  Shuang Liang,et al.  Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Marius Cotescu,et al.  Using Vaes and Normalizing Flows for One-Shot Text-To-Speech Synthesis of Expressive Speech , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[13]  Marc Schröder,et al.  Open Source Voice Creation Toolkit for the MARY TTS Platform , 2011, INTERSPEECH.

[14]  Bryan Catanzaro,et al.  Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis , 2021, ICLR.

[15]  Hema A. Murthy,et al.  Natural sounding TTS based on syllable-like units , 2006, 2006 14th European Signal Processing Conference.

[16]  Zhizheng Wu,et al.  Merlin: An Open Source Neural Network Speech Synthesis System , 2016, SSW.

[17]  M. G. Rahim,et al.  Articulatory synthesis with the aid of a neural net , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[18]  Bowen Zhou,et al.  Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed , 2020, INTERSPEECH.

[19]  Ryan Prenger,et al.  Mellotron: Multispeaker Expressive Voice Synthesis by Conditioning on Rhythm, Pitch and Global Style Tokens , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hung-yi Lee,et al.  WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU , 2020, INTERSPEECH.

[21]  Jason Taylor,et al.  Enhancing Sequence-to-Sequence Text-to-Speech with Morphology , 2020, INTERSPEECH.

[22]  Lei Xie,et al.  A New GAN-based End-to-End TTS Training Algorithm , 2019, INTERSPEECH.

[23]  Adriana Stan RECOApy: Data recording, pre-processing and phonetic transcription for end-to-end speech-based applications , 2020, INTERSPEECH.

[24]  Dong Yu,et al.  Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[26]  Sungwon Kim,et al.  FloWaveNet : A Generative Flow for Raw Audio , 2018, ICML.

[27]  Jing Xiao,et al.  MelGlow: Efficient Waveform Generative Network Based on Location-Variable Convolution , 2020, ArXiv.

[28]  Hideyuki Tachibana,et al.  Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  D. Lim,et al.  JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment , 2020, INTERSPEECH.

[30]  Wei Ping,et al.  Non-Autoregressive Neural Text-to-Speech , 2020, ICML.

[31]  Zhen-Hua Ling,et al.  Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Soroosh Mariooryad,et al.  Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis , 2020, ArXiv.

[33]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[34]  Petr Motlícek,et al.  Idlak Tangle: An Open Source Kaldi Based Parametric Speech Synthesiser Based on DNN , 2016, INTERSPEECH.

[35]  Tao Qin,et al.  MultiSpeech: Multi-Speaker Text to Speech with Transformer , 2020, INTERSPEECH.

[36]  Junichi Yamagishi,et al.  An experimental comparison of multiple vocoder types , 2013, SSW.

[37]  Masanori Morise,et al.  WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications , 2016, IEICE Trans. Inf. Syst..

[38]  Michael Schoeffler,et al.  webMUSHRA — A Comprehensive Framework for Web-based Listening Tests , 2018 .

[39]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[40]  Shuang Xu,et al.  First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention , 2016, INTERSPEECH.

[41]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[42]  Lior Wolf,et al.  VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop , 2017, ICLR.

[43]  Ariya Rastrow,et al.  Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion , 2019, INTERSPEECH.

[44]  Ryuichi Yamamoto,et al.  Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[46]  Bajibabu Bollepalli,et al.  GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram , 2019, INTERSPEECH.

[47]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[48]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[49]  Oliver Watts,et al.  Where do the improvements come from in sequence-to-sequence neural TTS? , 2019 .

[50]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[51]  Youngik Kim,et al.  VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network , 2020, INTERSPEECH.

[52]  Tian Xia,et al.  Aligntts: Efficient Feed-Forward Text-to-Speech System Without Explicit Alignment , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[53]  Kenneth N. Stevens,et al.  A Framework for Synthesis of Segments Based on Pseudoarticulatory Parameters , 1997 .

[54]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[55]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[56]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[57]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[58]  Jan Skoglund,et al.  LPCNET: Improving Neural Speech Synthesis through Linear Prediction , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Cassia Valentini-Botinhao,et al.  Are we using enough listeners? no! - an empirically-supported critique of interspeech 2014 TTS evaluations , 2015, INTERSPEECH.

[60]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[61]  Daniel Tihelka,et al.  Hybrid syllable/triphone speech synthesis , 2005, INTERSPEECH.

[62]  Nam Soo Kim,et al.  Reformer-TTS: Neural Speech Synthesis with Reformer Network , 2020, INTERSPEECH.

[63]  Chengzhu Yu,et al.  DurIAN: Duration Informed Attention Network for Speech Synthesis , 2020, INTERSPEECH.

[64]  Yoshua Bengio,et al.  Char2Wav: End-to-End Speech Synthesis , 2017, ICLR.

[65]  Simon King,et al.  An introduction to statistical parametric speech synthesis , 2011 .

[66]  Kainan Peng,et al.  WaveFlow: A Compact Flow-based Model for Raw Audio , 2020, ICML.

[67]  Heiga Zen,et al.  Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling , 2020, ArXiv.

[68]  Brian Roark,et al.  Neural Models of Text Normalization for Speech Applications , 2019, Computational Linguistics.

[69]  Junichi Yamagishi,et al.  Emotion transplantation through adaptation in HMM-based speech synthesis , 2015, Comput. Speech Lang..

[70]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[71]  Wei Ping,et al.  ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech , 2018, ICLR.

[72]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[73]  Adam Coates,et al.  Deep Voice: Real-time Neural Text-to-Speech , 2017, ICML.

[74]  Adriana Stan,et al.  Deep Learning for Automatic Diacritics Restoration in Romanian , 2019, 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP).

[75]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[76]  Zhi-Jie Yan,et al.  A Unified Trajectory Tiling Approach to High Quality Speech Rendering , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[77]  Takashi Saito,et al.  High-quality speech synthesis using context-dependent syllabic units , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[78]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[79]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[80]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[81]  Xu Tan,et al.  FastSpeech: Fast, Robust and Controllable Text to Speech , 2019, NeurIPS.

[82]  R. H. Stetson Motor phonetics : a study of speech movements in action , 1951 .

[83]  Heiga Zen,et al.  Generating Diverse and Natural Text-to-Speech Samples Using a Quantized Fine-Grained VAE and Autoregressive Prosody Prior , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[84]  Alan W. Black,et al.  CMU Wilderness Multilingual Speech Dataset , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[85]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[86]  Paul Taylor,et al.  Text-to-Speech Synthesis , 2009 .

[87]  Zhen-Hua Ling,et al.  The use of articulatory movement data in speech synthesis applications: An overview — Application of articulatory movements using machine learning algorithms — , 2015 .

[88]  Xin Wang,et al.  End-to-End Text-to-Speech Using Latent Duration Based on VQ-VAE , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[89]  Mireia Farrús,et al.  Naturalness Enhancement with Linguistic Information in End-to-End TTS Using Unsupervised Parallel Encoding , 2020, INTERSPEECH.

[90]  Simon King,et al.  Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[91]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[92]  Tanya Lambert,et al.  A database design for a TTS synthesis system using lexical diphones , 2004, INTERSPEECH.

[93]  Kurt Keutzer,et al.  SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis , 2020, ArXiv.

[94]  Yu Zhang,et al.  Learning Latent Representations for Speech Generation and Transformation , 2017, INTERSPEECH.

[95]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[96]  Adam Finkelstein,et al.  Fftnet: A Real-Time Speaker-Dependent Neural Vocoder , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[97]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[98]  Zhiyong Wu,et al.  Knowledge-Based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis , 2019, INTERSPEECH.

[99]  Tomoki Toda,et al.  Espnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).