But neural network features for spontaneous Vietnamese in BABEL

This paper presents our work on speech recognition of Vietnamese spontaneous telephone conversations. It focuses on feature extraction by Stacked Bottle-Neck neural networks: several improvements such as semi-supervised training on untranscribed data, increasing of precision of state targets, and CMLLR adaptations were investigated. We have also tested speaker adaptive training of this architecture and significant gain was found. The results are reported on BABEL Vietnamese data.

[1]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[2]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[3]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[4]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Hermann Ney,et al.  Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[6]  Martin Karafiát,et al.  Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Mark J. F. Gales,et al.  Tandem system adaptation using multiple linear feature transforms , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Jan Cernocký,et al.  BUT BABEL system for spontaneous Cantonese , 2013, INTERSPEECH.

[9]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Pietro Laface,et al.  On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[11]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[12]  Richard M. Schwartz,et al.  Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[13]  Lukás Burget,et al.  Semi-supervised training of Deep Neural Networks , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[15]  Richard M. Schwartz,et al.  Recent progress on the discriminative region-dependent transform for speech feature extraction , 2006, INTERSPEECH.

[16]  Lukás Burget,et al.  Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[17]  Hui Jiang,et al.  Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Martin Karafiát,et al.  Adaptation of multilingual stacked bottle-neck neural network structure for new language , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Kenneth Ward Church,et al.  Deep neural network features and semi-supervised training for low resource speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[21]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).