Unsupervised Adaptation of Acoustic Models for ASR Using Utterance-Level Embeddings from Squeeze and Excitation Networks

This paper proposes the adaptation of neural network-based acoustic models using a Squeeze-and-Excitation (SE) network for automatic speech recognition (ASR). In particular, this work explores to use the SE network to learn utterance-level embeddings. The acoustic modelling is performed using Light Gated Recurrent Units (LiGRU). The utterance embed-dings are learned from hidden unit activations jointly with LiGRU and used to scale respective activations of hidden layers in the LiGRU network. The advantage of such approach is that it does not require domain labels, such as speakers and noise to be known in order to perform the adaptation, thereby providing unsupervised adaptation. The global average and attentive pooling are applied on hidden units to extract utterance-level information that represents the speakers and acoustic conditions. ASR experiments were carried out on the TIMIT and Aurora 4 corpora. The proposed model achieves better performance on both the datasets compared to their respective baselines with relative improvements of 5.59% and 5.54% for TIMIT and Aurora 4 database, respectively. These experiments show the potential of using the conditioning information learned via utterance embeddings in the SE network to adapt acoustic models for speakers, noise, and other acoustic conditions.

[1]  Ming Li,et al.  Utterance-level End-to-end Language Identification Using Attention-based CNN-BLSTM , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Xiaodong Cui,et al.  English Conversational Telephone Speech Recognition by Humans and Machines , 2017, INTERSPEECH.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Tetsuji Ogawa,et al.  Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[9]  Mark J. F. Gales,et al.  Multi-basis adaptive neural network for rapid adaptation in speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yanmin Qian,et al.  Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Shoukang Hu,et al.  BLHUC: Bayesian Learning of Hidden Unit Contributions for Deep Neural Network Speaker Adaptation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Peter Bell,et al.  Learning to adapt: a meta-learning approach for speaker adaptation , 2018, Interspeech 2018.

[13]  Shinji Watanabe,et al.  Sequence summarizing neural network for speaker adaptation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[15]  Daniel Povey,et al.  Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[16]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[17]  Mei-Yuh Hwang,et al.  Training Augmentation with Adversarial Examples for Robust Speech Recognition , 2018, INTERSPEECH.

[18]  Kai Yu,et al.  Cluster Adaptive Training for Deep Neural Network Based Acoustic Model , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Tomohiro Nakatani,et al.  Context Adaptive Neural Network Based Acoustic Models for Rapid Adaptation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[21]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[23]  Yongqiang Wang,et al.  Adaptation of deep neural network acoustic models using factorised i-vectors , 2014, INTERSPEECH.

[24]  Hu Hu,et al.  Adaptive Very Deep Convolutional Residual Network for Noise Robust Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[27]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[28]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[29]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[30]  Steve Renals,et al.  Differentiable Pooling for Unsupervised Acoustic Model Adaptation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Qi Liu,et al.  Noise Robust Speech Recognition on Aurora4 by Humans and Machines , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Raymond W. M. Ng,et al.  Latent Dirichlet Allocation based organisation of broadcast media archives for deep neural network adaptation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[34]  Pietro Laface,et al.  Adaptation of Hybrid ANN/HMM Models Using Linear Hidden Transformations and Conservative Training , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[35]  Ke Wang,et al.  Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model , 2018, INTERSPEECH.

[36]  Biing-Hwang Juang,et al.  Speaker-Invariant Training Via Adversarial Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Xiaodong Cui,et al.  Embedding-Based Speaker Adaptive Training of Deep Neural Networks , 2017, INTERSPEECH.

[38]  Tasha Nagamine,et al.  Understanding the Representation and Computation of Multilayer Perceptrons: A Case Study in Speech Recognition , 2017, ICML.

[39]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Titouan Parcollet,et al.  Quaternion Recurrent Neural Networks , 2018, ICLR.

[41]  Yifan Gong,et al.  Domain and Speaker Adaptation for Cortana Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Yoshua Bengio,et al.  Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition , 2017, INTERSPEECH.

[44]  Steve Renals,et al.  Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[45]  Bhuvana Ramabhadran,et al.  Invariant Representations for Noisy Speech Recognition , 2016, ArXiv.