Using Personalized Speech Synthesis and Neural Language Generator for Rapid Speaker Adaptation

We propose to use the personalized speech synthesis and the neural language generator to synthesize content relevant personalized speech for rapid speaker adaptation. It has two distinct aspects: First, it relieves the general data sparsity issue in rapid adaptation via making use of additional synthesized personalized speech; Second, it circumvents the obstacle of the explicit labeling error in unsupervised adaptation by converting it to pseudo-supervised adaptation. In this setup, the labeling error is implicitly rendered as less damaging speech distortion in the personalized synthesized speech. This results in significant performance breakthrough in the rapid unsupervised speaker adaptation. We apply the proposed methodology to a speaker adaptation task in a state-of-art speech transcription system. With 1 minute (min) adaptation data, our proposed approach yields 9.19 % or 5.98 % relative word error rate (WER) reduction for the supervised and the unsupervised adaptation, comparing to the negligible gain when adapting only with 1 min original speech. With 10 min adaptation data, it yields 12.53 % or 7.89 % relative WER reduction, doubling the gain of the baseline adaptation. The proposed approach is particularly suitable for unsupervised adaptation.

[1]  I-Fan Chen,et al.  Maximum a posteriori adaptation of network parameters in deep models , 2015, INTERSPEECH.

[2]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Steve Renals,et al.  Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[5]  Tomohiro Nakatani,et al.  Context adaptive deep neural networks for fast acoustic model adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Shinji Watanabe,et al.  Auxiliary Feature Based Adaptation of End-to-end ASR Systems , 2018, INTERSPEECH.

[7]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[8]  Kai Yu,et al.  Cluster Adaptive Training for Deep Neural Network Based Acoustic Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[11]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[12]  Florian Metze,et al.  Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[13]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[14]  Jinyu Li,et al.  Hermitian based Hidden Activation Functions for Adaptation of Hybrid HMM/ANN Models , 2012, INTERSPEECH.

[15]  Yu Wang,et al.  Speaker Adaptation and Adaptive Training for Jointly Optimised Tandem Systems , 2018, INTERSPEECH.

[16]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[17]  Xiaodong Cui,et al.  Data Augmentation for Deep Neural Network Acoustic Modeling , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Steve Renals,et al.  Differentiable pooling for unsupervised speaker adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[21]  Yifan Gong,et al.  Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[23]  Frank K. Soong,et al.  Modeling Multi-speaker Latent Space to Improve Neural TTS: Quick Enrolling New Speaker and Enhancing Premium Voice , 2018, ArXiv.

[24]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Yifan Gong,et al.  Regularized sequence-level deep neural network model adaptation , 2015, INTERSPEECH.

[26]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Khe Chai Sim,et al.  Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.