Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model

Sequence-to-sequence models provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In this work, we look at one such sequence-to-sequence model, namely listen, attend and spell (LAS) [1], and explore the possibility of training a single model to serve different English dialects, which simplifies the process of training multi-dialect systems without the need for separate AM, PM and LMs for each dialect. We show that simply pooling the data from all dialects into one LAS model falls behind the performance of a model fine-tuned on each dialect. We then look at incorporating dialect-specific information into the model, both by modifying the training targets by inserting the dialect symbol at the end of the original grapheme sequence and also feeding a 1-hot representation of the dialect information into all layers of the model. Experimental results on seven English dialects show that our proposed system is effective in modeling dialect variations within a single LAS model, outperforming a LAS model trained individually on each of the seven dialects by 3.1~16.5% relative.

[1]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Hasim Sak,et al.  Multi-accent speech recognition with hierarchical grapheme based models , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yifan Gong,et al.  Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation , 2014, INTERSPEECH.

[4]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[5]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  John R. Hershey,et al.  Language independent end-to-end architecture for joint language identification and speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[7]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[10]  Vassilios Diakoloukas,et al.  Development of dialect-specific speech recognizers using adaptation methods , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Liang Lu,et al.  On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Kai Yu,et al.  Cluster Adaptive Training for Deep Neural Network Based Acoustic Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[16]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[17]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hui Lin,et al.  Learning Methods in Multilingual Speech Recognition , 2008, NIPS 2008.

[19]  Hermann Ney,et al.  Multilingual acoustic modeling using graphemes , 2003, INTERSPEECH.

[20]  Lei Xie,et al.  Attention-Based End-to-End Speech Recognition on Voice Search , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Quoc V. Le,et al.  Listen, Attend and Spell , 2015, ArXiv.

[22]  Yu Zhang,et al.  Very deep convolutional networks for end-to-end speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Fadi Biadsy,et al.  Google's cross-dialect Arabic voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[25]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[27]  William J. Byrne,et al.  Towards language independent acoustic modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[28]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[29]  Pedro J. Moreno,et al.  Towards acoustic model unification across dialects , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[30]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[31]  Dirk Van Compernolle Recognizing speech of goats, wolves, sheep and ... non-natives , 2001, Speech Commun..

[32]  Hynek Hermansky,et al.  Cross-lingual and multi-stream posterior features for low resource LVCSR systems , 2010, INTERSPEECH.

[33]  Tanja Schultz,et al.  Towards universal speech recognition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[34]  Pedro J. Moreno,et al.  Multi-Dialectical Languages Effect on Speech Recognition: Too Much Choice Can Hurt , 2015, ICNLSP.