Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation

This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic model, working both in a speaker-dependent and a speaker-independent manner, without the requirements to maintain auxiliary speaker-dependent feature extractors or to introduce significant speaker-dependent changes to the DNN structure. Through a series of experiments on four different speech recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4) comprising 270 test speakers, we show that LHUC in both its test-only and SAT variants results in consistent word error rate reductions ranging from 5% to 23% relative depending on the task and the degree of mismatch between training and test data. In addition, we have investigated the effect of the amount of adaptation data per speaker, the quality of unsupervised adaptation targets, the complementarity to other adaptation techniques, one-shot adaptation, and an extension to adapting DNNs trained in a sequence discriminative manner.

[1]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[2]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[3]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Yifan Gong,et al.  Regularized sequence-level deep neural network model adaptation , 2015, INTERSPEECH.

[5]  Koichi Shinoda,et al.  Speaker adaptation of deep neural networks using a hierarchy of output layers , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[6]  Chao Zhang,et al.  Parameterised sigmoid and reLU hidden activation functions for DNN acoustic modelling , 2015, INTERSPEECH.

[7]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Jonathan G. Fiscus,et al.  2000 NIST EVALUATION OF CONVERSATIONAL SPEECH RECOGNITION OVER THE TELEPHONE: ENGLISH AND MANDAR IN PERFORMANCE RESULTS , 2000 .

[9]  Hervé Bourlard,et al.  Connectionist probability estimators in HMM speech recognition , 1994, IEEE Trans. Speech Audio Process..

[10]  Vaibhava Goel,et al.  Annealed dropout training of deep networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[11]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[12]  Ji Wu,et al.  Rapid adaptation for deep neural networks through multi-task learning , 2015, INTERSPEECH.

[13]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[14]  Thomas Hain,et al.  Recognition and understanding of meetings the AMI and AMIDA projects , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[15]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[16]  Steve Renals,et al.  Multi-level adaptive networks in tandem and hybrid ASR systems , 2013, ICASSP.

[17]  Mark J. F. Gales,et al.  Multi-basis adaptive neural network for rapid adaptation in speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Sebastian Stüker,et al.  Overview of the IWSLT 2012 evaluation campaign , 2012, IWSLT.

[19]  Steve Renals,et al.  Differentiable pooling for unsupervised speaker adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Kai Yu,et al.  Cluster adaptive training for deep neural network , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[23]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[24]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[25]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Peter Bell,et al.  Structured output layer with auxiliary targets for context-dependent acoustic modelling , 2015, INTERSPEECH.

[27]  Yifan Gong,et al.  Factorized adaptation for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[29]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[30]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[31]  Edmondo Trentin,et al.  Networks with trainable amplitude of activation functions , 2001, Neural Networks.

[32]  Yongqiang Wang,et al.  Adaptation of deep neural network acoustic models using factorised i-vectors , 2014, INTERSPEECH.

[33]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[34]  Zdravko Kacic,et al.  A novel loss function for the overall risk criterion based discriminative training of HMM models , 2000, INTERSPEECH.

[35]  I-Fan Chen,et al.  Maximum a posteriori adaptation of network parameters in deep models , 2015, INTERSPEECH.

[36]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Hui Jiang,et al.  Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition , 2013, INTERSPEECH.

[38]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[39]  Fergus McInnes,et al.  The UEDIN ASR Systems for the IWSLT 2014 Evaluation , 2014 .

[40]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[41]  Florian Metze,et al.  Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42]  Steve Renals,et al.  Connectionist probability estimation in the DECIPHER speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[45]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[46]  Tara N. Sainath,et al.  Auto-encoder bottleneck features using deep belief networks , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[47]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[48]  Khe Chai Sim,et al.  On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalization in DNN acoustic models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Khe Chai Sim,et al.  Learning factorized feature transforms for speaker normalization , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[50]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[51]  Lukás Burget,et al.  iVector-based discriminative adaptation for automatic speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[52]  Stephen Cox,et al.  RecNorm: Simultaneous Normalisation and Classification Applied to Speech Recognition , 1990, NIPS.

[53]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[55]  Zhizheng Wu,et al.  Human vs machine spoofing detection on wideband and narrowband data , 2015, INTERSPEECH.

[56]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[57]  Li-Rong Dai,et al.  Fast Adaptation of Deep Neural Network Based on Discriminant Codes for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[58]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[59]  Steve Renals,et al.  Revisiting hybrid and GMM-HMM system combination techniques , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[61]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .

[62]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[63]  Mark J. F. Gales,et al.  Cambridge university transcription systems for the multi-genre broadcast challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[64]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[65]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[66]  Steve Renals,et al.  SAT-LHUC: Speaker adaptive training for learning hidden unit contributions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[67]  S. M. Siniscalchi,et al.  Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[68]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[69]  Lukás Burget,et al.  Transcribing Meetings With the AMIDA Systems , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[70]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[71]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[72]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[73]  Shigeru Katagiri,et al.  Speaker Adaptive Training using Deep Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Tomohiro Nakatani,et al.  Context adaptive deep neural networks for fast acoustic model adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[75]  R. French,et al.  Catastrophic Forgetting in Connectionist Networks: Causes, Consequences and Solutions , 1994 .

[76]  Hideki Kashioka,et al.  The NICT ASR system for IWSLT2011 , 2011, IWSLT.

[77]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[78]  Petr Motlícek,et al.  Towards utterance-based neural network adaptation in acoustic modeling , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[79]  Yifan Gong,et al.  Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[80]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[81]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[82]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[83]  Yifan Gong,et al.  Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[84]  Horacio Franco,et al.  Connectionist speaker normalization and adaptation , 1995, EUROSPEECH.

[85]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.