Occam’s Adaptation: A Comparison of Interpolation of Bases Adaptation Methods for Multi-Dialect Acoustic Modeling with LSTMS

Multidialectal languages can pose challenges for acoustic modeling. Past research has shown that with a large training corpus but without explicit modeling of inter-dialect variability, training individual per-dialect models yields superior performance to that of a single model trained on the combined data [1, 2]. In this work, we were motivated by the idea that adaptation techniques can allow the models to learn dialect-independent features and in turn leverage the power of the larger training corpus sizes afforded when pooling data across dialects. Our goal was thus to create a single multidialect acoustic model that would rival the performance of the dialect-specific models.Working in the context of deep Long-Short Term Memory (LSTM) acoustic models trained on up to 40K hours of speech, we explored several methods for training and incorporating dialect-specific information into the model, including 12 variants of interpolation-of-bases techniques related to Cluster Adaptive Training (CAT) [3] and Factorized Hidden Layer (FHL) [4] techniques. We found that with our model topology and large training corpus, simply appending the dialect-specific information to the feature vector resulted in a more accurate model than any of the more complex interpolation-of-bases techniques, while requiring less model complexity and fewer parameters. This simple adaptation yielded a single unified model for all dialects that, in most cases, outperformed individual models which had been trained per-dialect.

[1]  Fadi Biadsy,et al.  Google's cross-dialect Arabic voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Yanpeng Li,et al.  Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer , 2015, INTERSPEECH.

[3]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Vassilios Digalakis,et al.  On the integration of dialect and speaker adaptation in a multi-dialect speech recognition system , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[5]  Tara N. Sainath,et al.  Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7]  Julius Kunze,et al.  Transfer Learning for Speech Recognition on a Budget , 2017, Rep4NLP@ACL.

[8]  Pedro J. Moreno,et al.  Multi-Dialectical Languages Effect on Speech Recognition: Too Much Choice Can Hurt , 2015, ICNLSP.

[9]  Khe Chai Sim,et al.  Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Y. Patel,et al.  An integrated multi-dialect speech recognition system with optional speaker adaptation , 1995, EUROSPEECH.

[11]  Dirk Van Compernolle,et al.  Speaker clustering for dialectic robustness in speaker independent recognition , 1991, EUROSPEECH.

[12]  Lyle Campbell,et al.  Ethnologue: Languages of the world (review) , 2008 .

[13]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[14]  Steve Renals,et al.  Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[16]  Jianwu Dang,et al.  Exploring tonal information for Lhasa dialect acoustic modeling , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[17]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Albino Nogueiras,et al.  Multidialectal Spanish acoustic modeling for speech recognition , 2009, Speech Commun..

[19]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Tara N. Sainath,et al.  Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[22]  Ke Wang,et al.  Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model , 2018, INTERSPEECH.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Ahmed Abdelali,et al.  Spoken Arabic Algerian dialect identification , 2018, 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP).

[25]  Khe Chai Sim,et al.  Learning Factorized Transforms for Unsupervised Adaptation of LSTM-RNN Acoustic Models , 2017, INTERSPEECH.

[26]  Souvik Kundu,et al.  Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[27]  Vassilios Diakoloukas,et al.  Development of dialect-specific speech recognizers using adaptation methods , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Tara N. Sainath,et al.  Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[29]  Andrew W. Senior,et al.  Flat start training of CD-CTC-SMBR LSTM RNN acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  P. Lewis Ethnologue : languages of the world , 2009 .

[31]  Kai Yu,et al.  Cluster Adaptive Training for Deep Neural Network Based Acoustic Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32]  Khe Chai Sim,et al.  Low-rank bases for factorized hidden layer adaptation of DNN acoustic models , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[33]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34]  Margo E. Wilson Arabic Speakers: Language and Culture, Here and Abroad , 1996 .

[35]  Yuqing Gao,et al.  Speaker-independent upfront dialect adaptation in a large vocabulary continuous speech recognizer , 1998, ICSLP.

[36]  Pedro J. Moreno,et al.  Towards acoustic model unification across dialects , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[37]  Yifan Gong,et al.  Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation , 2014, INTERSPEECH.

[38]  Khe Chai Sim,et al.  learning Effective Factorized Hidden Layer Bases Using Student-Teacher Training for LSTM Acoustic Model Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).