A Highly Adaptive Acoustic Model for Accurate Multi-dialect Speech Recognition

Despite the success of deep learning in speech recognition, multi-dialect speech recognition remains a difficult problem. Although dialect-specific acoustic models are known to perform well in general, they are not easy to maintain when dialect-specific data is scarce and the number of dialects for each language is large. Therefore, a single unified acoustic model (AM) that generalizes well for many dialects has been in demand. In this paper, we propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM. Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously. We also propose a simple but effective training method to deal with unseen dialects. The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.

[1]  Yanpeng Li,et al.  Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer , 2015, INTERSPEECH.

[2]  Yoshua Bengio,et al.  Feature-wise transformations , 2018, Distill.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Yifan Gong,et al.  Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation , 2014, INTERSPEECH.

[5]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[6]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[7]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[9]  Philippe Preux,et al.  Visual Reasoning with Multi-hop Feature Modulation , 2018, ECCV.

[10]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[12]  Pedro J. Moreno,et al.  Towards acoustic model unification across dialects , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[15]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Bhuvana Ramabhadran,et al.  Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[18]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[19]  Preethi Jyothi,et al.  Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning , 2018, INTERSPEECH.

[20]  Yoshua Bengio,et al.  Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition , 2017, INTERSPEECH.

[21]  Tara N. Sainath,et al.  Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).