Complementary tasks for context-dependent deep neural network acoustic models

We have previously found that context-dependent DNN models for automatic speech recognition can be improved with the use of monophone targets as a secondary task for the network. This paper asks whether the improvements derive from the regularising effect of having a much small number of monophone outputs – compared to the typical number of tied states – or from the use of targets that are not tied to an arbitrary stateclustering. We investigate the use of factorised targets for left and right context, and targets motivated by articulatory properties of the phonemes. We present results on a large-vocabulary lecture recognition task. Although the regularising effect of monophones seems to be important, all schemes give substantial improvements over the baseline single task system, even though the cardinality of the outputs is relatively high.

[1]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[2]  Michael Cohen,et al.  Context-Dependent Connectionist Probability Estimation in a Hybrid HMM-Neural Net Speech Recognition , 1994 .

[3]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[4]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Steve J. Young,et al.  State clustering in hidden Markov model-based continuous speech recognition , 1994, Comput. Speech Lang..

[7]  Dong Yu,et al.  Context-dependent Deep Neural Networks for audio indexing of real-life data , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[8]  Phil D. Green,et al.  Multitask learning in connectionist robust ASR using recurrent neural networks , 2003, INTERSPEECH.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Peter Bell,et al.  Regularization of context-dependent deep neural networks with context-independent multi-task training , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Dong Yu,et al.  Large vocabulary continuous speech recognition with context-dependent DBN-HMMS , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Simon King,et al.  Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Cheung-Chi Leung,et al.  Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Simon King,et al.  A hybrid ANN/DBN approach to articulatory feature recognition , 2005, INTERSPEECH.

[16]  Hervé Bourlard,et al.  Factoring Networks by a Statistical Method , 1992, Neural Computation.

[17]  Ramya Rasipuram,et al.  Improving Articulatory Feature and Phoneme Recognition Using Multitask Learning , 2011, ICANN.

[18]  Horacio Franco,et al.  Context-dependent connectionist probability estimation in a hybrid hidden Markov model-neural net speech recognition system , 1994, Comput. Speech Lang..

[19]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[20]  Philip C. Woodland,et al.  Standalone training of context-dependent deep neural network acoustic models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Michiel Bacchiani,et al.  Context dependent state tying for speech recognition using deep neural network acoustic models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jasha Droppo,et al.  Multi-task learning in deep neural networks for improved phoneme recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Steve Renals,et al.  The 1995 ABBOT LVCSR system for multiple unknown microphones , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  Hank Liao,et al.  Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[25]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[26]  Guangsen Wang,et al.  Regression-Based Context-Dependent Modeling of Deep Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..