论文信息 - BUT 2014 Babel system: analysis of adaptation in NN based systems

BUT 2014 Babel system: analysis of adaptation in NN based systems

Features based on a hierarchy of neural networks with compressive layers – Stacked Bottle-Neck (SBN) features – were recently shown to provide excellent performance in LVCSR systems. This paper summarizes several techniques investigated in our work towards Babel 2014 evaluations: (1) using several versions of fundamental frequency (F0) estimates, (2) semi-supervised training on un-transcribed data and mainly (3) adapting the NN structure at different levels. They are tested on three 2014 Babel languages with full GMMand DNN-based systems. Separately and in combination, they are shown to outperform the baselines and confirm the usefulness of bottle-neck features in current ASR systems.

[1] Florian Metze,et al. Models of tone for tonal and non-tonal languages , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[2] Wonkyum Lee,et al. Modular combination of deep neural networks for acoustic modeling , 2013, INTERSPEECH.

[3] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[4] Jan Cernocký,et al. BUT BABEL system for spontaneous Cantonese , 2013, INTERSPEECH.

[5] Mattias Heldner,et al. The fundamental frequency variation spectrum , 2008 .

[6] Geoffrey E. Hinton,et al. Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Hideki Kawahara,et al. YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[8] Jonathan G. Fiscus,et al. Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[9] Lukás Burget,et al. Comparison of keyword spotting approaches for informal continuous speech , 2005, INTERSPEECH.

[10] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11] Jonathan Le Roux,et al. Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Hermann Ney,et al. Confidence measures for large vocabulary continuous speech recognition , 2001, IEEE Trans. Speech Audio Process..

[13] David Talkin,et al. A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[14] Sanjeev Khudanpur,et al. A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Martin Karafiát,et al. Convolutive Bottleneck Network features for LVCSR , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16] Richard M. Schwartz,et al. Recent progress on the discriminative region-dependent transform for speech feature extraction , 2006, INTERSPEECH.

[17] Lukás Burget,et al. Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[18] Jan Cernocký,et al. But neural network features for spontaneous Vietnamese in BABEL , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[20] Peter Kulchyski. and , 2015 .

[21] Andreas G. Andreou,et al. Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[22] Richard M. Schwartz,et al. Score normalization and system combination for improved keyword spotting , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.