论文信息 - Occam’s Adaptation: A Comparison of Interpolation of Bases Adaptation Methods for Multi-Dialect Acoustic Modeling with LSTMS

Occam’s Adaptation: A Comparison of Interpolation of Bases Adaptation Methods for Multi-Dialect Acoustic Modeling with LSTMS

Multidialectal languages can pose challenges for acoustic modeling. Past research has shown that with a large training corpus but without explicit modeling of inter-dialect variability, training individual per-dialect models yields superior performance to that of a single model trained on the combined data [1, 2]. In this work, we were motivated by the idea that adaptation techniques can allow the models to learn dialect-independent features and in turn leverage the power of the larger training corpus sizes afforded when pooling data across dialects. Our goal was thus to create a single multidialect acoustic model that would rival the performance of the dialect-specific models.Working in the context of deep Long-Short Term Memory (LSTM) acoustic models trained on up to 40K hours of speech, we explored several methods for training and incorporating dialect-specific information into the model, including 12 variants of interpolation-of-bases techniques related to Cluster Adaptive Training (CAT) [3] and Factorized Hidden Layer (FHL) [4] techniques. We found that with our model topology and large training corpus, simply appending the dialect-specific information to the feature vector resulted in a more accurate model than any of the more complex interpolation-of-bases techniques, while requiring less model complexity and fewer parameters. This simple adaptation yielded a single unified model for all dialects that, in most cases, outperformed individual models which had been trained per-dialect.

Eugene Weinstein | Mikaela Grace | Meysam Bastani

[1] Fadi Biadsy,et al. Google's cross-dialect Arabic voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Yanpeng Li,et al. Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer , 2015, INTERSPEECH.

[3] Kaisheng Yao,et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4] Vassilios Digalakis,et al. On the integration of dialect and speaker adaptation in a multi-dialect speech recognition system , 1998, 9th European Signal Processing Conference (EUSIPCO 1998).

[5] Tara N. Sainath,et al. Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7] Julius Kunze,et al. Transfer Learning for Speech Recognition on a Budget , 2017, Rep4NLP@ACL.

[8] Pedro J. Moreno,et al. Multi-Dialectical Languages Effect on Speech Recognition: Too Much Choice Can Hurt , 2015, ICNLSP.

[9] Khe Chai Sim,et al. Factorized Hidden Layer Adaptation for Deep Neural Network Based Acoustic Modeling , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Y. Patel,et al. An integrated multi-dialect speech recognition system with optional speaker adaptation , 1995, EUROSPEECH.

[11] Dirk Van Compernolle,et al. Speaker clustering for dialectic robustness in speaker independent recognition , 1991, EUROSPEECH.

[12] Lyle Campbell,et al. Ethnologue: Languages of the world (review) , 2008 .

[13] Andrew W. Senior,et al. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[14] Steve Renals,et al. Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[16] Jianwu Dang,et al. Exploring tonal information for Lhasa dialect acoustic modeling , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[17] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Albino Nogueiras,et al. Multidialectal Spanish acoustic modeling for speech recognition , 2009, Speech Commun..

[19] Joaquín González-Rodríguez,et al. Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Hui Jiang,et al. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Tara N. Sainath,et al. Lower Frame Rate Neural Network Acoustic Models , 2016, INTERSPEECH.

[22] Ke Wang,et al. Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model , 2018, INTERSPEECH.

[23] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[24] Ahmed Abdelali,et al. Spoken Arabic Algerian dialect identification , 2018, 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP).

[25] Khe Chai Sim,et al. Learning Factorized Transforms for Unsupervised Adaptation of LSTM-RNN Acoustic Models , 2017, INTERSPEECH.

[26] Souvik Kundu,et al. Adaptation of Deep Neural Network Acoustic Models for Robust Automatic Speech Recognition , 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning.

[27] Vassilios Diakoloukas,et al. Development of dialect-specific speech recognizers using adaptation methods , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28] Tara N. Sainath,et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home , 2017, INTERSPEECH.

[29] Andrew W. Senior,et al. Flat start training of CD-CTC-SMBR LSTM RNN acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30] P. Lewis. Ethnologue : languages of the world , 2009 .

[31] Kai Yu,et al. Cluster Adaptive Training for Deep Neural Network Based Acoustic Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[32] Khe Chai Sim,et al. Low-rank bases for factorized hidden layer adaptation of DNN acoustic models , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[33] Tanja Schultz,et al. Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[34] Margo E. Wilson. Arabic Speakers: Language and Culture, Here and Abroad , 1996 .

[35] Yuqing Gao,et al. Speaker-independent upfront dialect adaptation in a large vocabulary continuous speech recognizer , 1998, ICSLP.

[36] Pedro J. Moreno,et al. Towards acoustic model unification across dialects , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[37] Yifan Gong,et al. Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation , 2014, INTERSPEECH.

[38] Khe Chai Sim,et al. learning Effective Factorized Hidden Layer Bases Using Student-Teacher Training for LSTM Acoustic Model Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).