论文信息 - Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-to-end neural speech recognition models using the LibriSpeech dataset augmented with synthetic speech. These new models achieve state of the art Word Error Rate (WER) for character-level based models without an external language model.

[1] Kyu J. Han,et al. The CAPIO 2017 Conversational Speech Recognition System , 2017, ArXiv.

[2] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[3] Geoffrey E. Hinton,et al. A time-delay neural network architecture for isolated word recognition , 1990, Neural Networks.

[4] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Hermann Ney,et al. Improved training of end-to-end attention models for speech recognition , 2018, INTERSPEECH.

[6] Gabriel Synnaeve,et al. Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[7] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[8] Erich Elsen,et al. Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[9] Yiming Wang,et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks , 2018, INTERSPEECH.

[10] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[12] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[13] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14] Boris Ginsburg,et al. OpenSeq2Seq: Extensible Toolkit for Distributed and Mixed Precision Training of Sequence-to-Sequence Models , 2018, ArXiv.