Cambridge university transcription systems for the multi-genre broadcast challenge

We describe the development of our speech-to-text transcription systems for the 2015 Multi-Genre Broadcast (MGB) challenge. Key features of the systems are: a segmentation system based on deep neural networks (DNNs); the use of HTK 3.5 for building DNN-based hybrid and tandem acoustic models and the use of these models in a joint decoding framework; techniques for adaptation of DNN based acoustic models including parameterised activation function adaptation; alternative acoustic models built using Kaldi; and recurrent neural network language models (RNNLMs) and RNNLM adaptation. The same language models were used with both HTK and Kaldi acoustic models and various combined systems built. The final systems had the lowest error rates on the evaluation data.

[1]  Terrence J. Sejnowski,et al.  Variational Learning of Clusters of Undercomplete Nonsymmetric Independent Components , 2003, J. Mach. Learn. Res..

[2]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[3]  Jonathan G. Fiscus,et al.  REDUCED WORD ERROR RATES , 1997 .

[4]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel Povey,et al.  Minimum Phone Error and I-smoothing for improved discriminative training , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Yonghong Yan,et al.  Prefix tree based n-best list re-scoring for recurrent neural network language model used in speech recognition system , 2013, INTERSPEECH.

[7]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Tara N. Sainath,et al.  Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[9]  Susan Fitt,et al.  Robust LTS rules with the Combilex speech technology lexicon , 2009, INTERSPEECH.

[10]  Mark J. F. Gales,et al.  Automatic complexity control for HLDA systems , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Mark J. F. Gales,et al.  Speaker diarisation and longitudinal linking in multi-genre broadcast data , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[12]  Steve Renals,et al.  Differentiable pooling for unsupervised speaker adaptation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Philip C. Woodland,et al.  The development of the HTK Broadcast News transcription system: An overview , 2002, Speech Commun..

[15]  Yongqiang Wang,et al.  Efficient lattice rescoring using recurrent neural network language models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[17]  Yongqiang Wang,et al.  Efficient GPU-based training of recurrent neural network language models using spliced sentence bunch , 2014, INTERSPEECH.

[18]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[20]  Mark J. F. Gales,et al.  Investigation of multilingual deep neural networks for spoken term detection , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[21]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[22]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Mark J. F. Gales,et al.  The Cambridge University March 2005 speaker diarisation system , 2005, INTERSPEECH.

[24]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[25]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Steve Renals,et al.  Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[27]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[28]  Chao Zhang,et al.  A general artificial neural network extension for HTK , 2015, INTERSPEECH.

[29]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[30]  Mark J. F. Gales,et al.  Recurrent neural network language model adaptation for multi-genre broadcast speech recognition , 2015, INTERSPEECH.

[31]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[32]  Steve Renals,et al.  Revisiting hybrid and GMM-HMM system combination techniques , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[34]  Mark J. F. Gales,et al.  The Cambridge University 2014 BOLT conversational telephone Mandarin Chinese LVCSR system for speech translation , 2015, INTERSPEECH.

[35]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[36]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[38]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[39]  Yifan Gong,et al.  Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[41]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[42]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[43]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[44]  Chao Zhang,et al.  Parameterised sigmoid and reLU hidden activation functions for DNN acoustic modelling , 2015, INTERSPEECH.

[45]  Dong Yu,et al.  Fuse Deep Neural Network and Gaussian Mixture Model Systems , 2015 .

[46]  Andreas Stolcke,et al.  Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[47]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[48]  Mark J. F. Gales,et al.  The development of the cambridge university alignment systems for the multi-genre broadcast challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[49]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[50]  Ricky Ho Yin Chan,et al.  Improving broadcast news transcription by lightly supervised discriminative training , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[51]  Susan Fitt,et al.  On generating combilex pronunciations via morphological analysis , 2010, INTERSPEECH.

[52]  S. M. Siniscalchi,et al.  Hermitian Polynomial for Speaker Adaptation of Connectionist Speech Recognition Systems , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  Mark J. F. Gales,et al.  I-vector estimation using informative priors for adaptation of deep neural networks , 2015, INTERSPEECH.

[54]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.