Deep Activation Mixture Model for Speech Recognition

Copyright © 2017 ISCA. Deep learning approaches achieve state-of-the-art performance in a range of applications, including speech recognition. However, the parameters of the deep neural network (DNN) are hard to interpret, which makes regularisation and adaptation to speaker or acoustic conditions challenging. This paper proposes the deep activation mixture model (DAMM) to address these problems. The output of one hidden layer is modelled as the sum of a mixture and residual models. The mixture model forms an activation function contour while the residual one models fluctuations around the contour. The use of the mixture model gives two advantages: First, it introduces a novel regularisation on the DNN. Second, it allows novel adaptation schemes. The proposed approach is evaluated on a large-vocabulary U.S. English broadcast news task. It yields a slightly better performance than the DNN baselines, and on the utterance-level unsupervised adaptation, the adapted DAMM acquires further performance gains.

[1]  Kai Yu,et al.  Cluster adaptive training for deep neural network , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[3]  Mark J. F. Gales,et al.  Multi-basis adaptive neural network for rapid adaptation in speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hui Jiang,et al.  Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition , 2013, INTERSPEECH.

[5]  Tomohiro Nakatani,et al.  Context adaptive deep neural networks for fast acoustic model adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Steve Renals,et al.  Differentiable pooling for unsupervised speaker adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Georg Heigold,et al.  A Gaussian Mixture Model layer jointly optimized with discriminative features within a Deep Neural Network architecture , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Peter Bell,et al.  Regularization of context-dependent deep neural networks with context-independent multi-task training , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Shiliang Zhang,et al.  Hybrid Orthogonal Projection and Estimation (HOPE): A New Framework to Learn Neural Networks , 2016, J. Mach. Learn. Res..

[12]  Pietro Laface,et al.  Linear hidden transformations for adaptation of hybrid ANN/HMM models , 2007, Speech Commun..

[13]  Heiga Zen,et al.  Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Dong Yu,et al.  Pipelined BackPropagation for Context-Dependent Deep Neural Networks , 2012 .

[15]  C. Zhang,et al.  DNN speaker adaptation using parameterised sigmoid and ReLU hidden activation functions , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Khe Chai Sim,et al.  Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems , 2010, INTERSPEECH.

[17]  Mark J. F. Gales,et al.  Combining i-vector representation and structured neural networks for rapid adaptation , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[19]  Mark J. F. Gales,et al.  Improving the interpretability of deep neural networks with stimulated learning , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[21]  Mark J. F. Gales,et al.  Stimulated Deep Neural Network for Speech Recognition , 2016, INTERSPEECH.

[22]  Benjamin Schrauwen,et al.  Factoring Variations in Natural Images with Deep Gaussian Mixture Models , 2014, NIPS.

[23]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[24]  Srinivasan Umesh,et al.  The development of the Cambridge University RT-04 diarisation system , 2004 .

[25]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[27]  C. Bishop Mixture density networks , 1994 .

[28]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..