Speech Recognition with Augmented Synthesized Speech

Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer. We find that improvements to speech recognition performance is achievable by augmenting training data with synthesized material. However, there remains a substantial gap in performance between recognizers trained on human speech those trained on synthesized speech.

[1]  Yu Zhang,et al.  Latent Sequence Decompositions , 2016, ICLR.

[2]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Fadi Biadsy,et al.  Effectively Building Tera Scale MaxEnt Language Models Incorporating Non-Linguistic Signals , 2017, INTERSPEECH.

[4]  Satoshi Nakamura,et al.  Listening while speaking: Speech chain by deep learning , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Tara N. Sainath,et al.  Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling , 2019, ArXiv.

[6]  Patrick Nguyen,et al.  Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis , 2018, NeurIPS.

[7]  Heiga Zen,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[8]  Satoshi Nakamura,et al.  Machine Speech Chain with One-shot Speaker Adaptation , 2018, INTERSPEECH.

[9]  Erich Elsen,et al.  Efficient Neural Audio Synthesis , 2018, ICML.

[10]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Boris Ginsburg,et al.  Training Neural Speech Recognition Systems with Synthetic Speech Augmentation , 2018, ArXiv.

[13]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[14]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[15]  Erica Lindsay Cooper,et al.  Text-to-Speech Synthesis Using Found Data for Low-Resource Languages , 2019 .

[16]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  P. Denes,et al.  The speech chain : the physics and biology of spoken language , 1963 .

[18]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  P. Denes The Speech Chain , 1963 .

[20]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.