论文信息 - Novel neural network based fusion for multistream ASR

Novel neural network based fusion for multistream ASR

Robustness of automatic speech recognition (ASR) to acoustic mismatches can be improved by multistream framework. Frequently used approach to combine decisions from individual streams involve training large number of neural networks, one for each possible stream combination. In this work, we propose to simplify the fusion by replacing the large number of fusion networks with a single fusion network. During training of the proposed fusion network, features from a stream are randomly dropped out. At test time, corrupted streams are identified and dropped out to improve robustness. Using the proposed approach, we were able to achieve significant reduction in number of parameters, while remaining in less than 2.5 % relative degradation of conventional fusion technique. Furthermore, proposed fusion network is also applied in a multistream ASR system to improve noise robustness of Aurora4 speech recognition task. Noticeable improvements were observed over baseline systems (relative improvement of 9.2 % in microphone mismatch and 3.2 % in additive noise conditions).

Hynek Hermansky | Sri Harish Reddy Mallidi | H. Hermansky

[1] Misha Pavel,et al. Towards ASR on partially corrupted speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2] Hynek Hermansky,et al. Adaptive Stream Fusion in Multistream Recognition of Speech , 2011, INTERSPEECH.

[3] Hynek Hermansky,et al. Sub-band based recognition of noisy speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Jiri Matas,et al. On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[5] Hynek Hermansky,et al. Estimating Classifier Performance in Unknown Noise , 2012, INTERSPEECH.

[6] Alexandros Potamianos,et al. Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[7] Tetsuji Ogawa,et al. Autoencoder based multi-stream combination for noise robust speech recognition , 2015, INTERSPEECH.

[8] Hynek Hermansky,et al. Towards subband-based speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[9] Hervé Bourlard,et al. Entropy-based Multi-stream Combination , 2002 .

[10] Tetsuji Ogawa,et al. Uncertainty estimation of DNN classifiers , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11] Hynek Hermansky,et al. TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[12] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[13] Hynek Hermansky,et al. Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15] Hynek Hermansky,et al. Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[16] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[17] Hervé Glotin,et al. Multi-stream adaptive evidence combination for noise robust ASR , 2001, Speech Commun..

[18] Richard M. Stern,et al. Towards machines that know when they do not know: Summary of work done at 2014 Frederick Jelinek Memorial Workshop , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] David Gelbart,et al. Improving word accuracy with Gabor feature extraction , 2002, INTERSPEECH.

[20] Hynek Hermansky,et al. Multistream Recognition of Speech: Dealing With Unknown Unknowns , 2013, Proceedings of the IEEE.

[21] Birger Kollmeier,et al. Optimization and evaluation of Gabor feature sets for ASR , 2008, INTERSPEECH.

[22] Hervé Bourlard,et al. A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.