论文信息 - Distance-aware DNNs for robust speech recognition

Distance-aware DNNs for robust speech recognition

Distant speech recognition (DSR) remains to be an open challenge, even for the state-of-the-art deep neural network (DNN) models. Previous work has attempted to improve DNNs under constantly distant speech. However, in real applications, the speaker-microphone distance (SMD) can be quite dynamic, varying even within a single utterance. This paper investigates how to alleviate the impact of dynamic SMD on DNN models. Our solution is to incorporate the frame-level SMD information into DNN training. Generation of the SMD information relies on a universal extractor that is learned on a meeting corpus. We study the utility of different architectures in instantiating the SMD extractor. On our target acoustic modeling task, two approaches are proposed to build distance-aware DNN models using the SMD information: simple concatenation and distance adaptive training (DAT). Our experiments show that in the simplest case, incorporating the SMD descriptors improves word error rates of DNNs by 5.6% relative. Further optimizing SMD extraction and integration results in more gains.

Florian Metze | Yajie Miao | Florian Metze | Y. Miao

[1] Florian Metze,et al. Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Yifan Gong,et al. A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models , 2014, INTERSPEECH.

[3] Quoc V. Le,et al. Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[4] Chalapathy Neti,et al. Asynchrony modeling for audio-visual speech recognition , 2002 .

[5] Daniel Povey,et al. Revisiting Recurrent Neural Networks for robust ASR , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Georg Heigold,et al. Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[7] Thomas Hain,et al. Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Florian Metze,et al. Improving language-universal feature extraction with deep maxout and convolutional neural networks , 2014, INTERSPEECH.

[9] Gerald Penn,et al. Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Jinyu Li,et al. Feature Learning in Deep Neural Networks - Studies on Speech Recognition Tasks. , 2013, ICLR 2013.

[11] Andreas Stolcke,et al. The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12] Yongqiang Wang,et al. An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Florian Metze,et al. Deep maxout networks for low-resource speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15] 政子鶴岡,et al. 1998 IEEE International Conference on SMCに参加して , 1998 .

[16] Alexander I. Rudnicky,et al. Distributed representation-based spoken word sense induction , 2015, INTERSPEECH.

[17] Florian Metze,et al. Improvements to speaker adaptive training of deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[18] Thomas Hain,et al. An Analysis of Automatic Speech Recognition with Multiple Microphones , 2011, INTERSPEECH.

[19] Yifan Gong,et al. An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Frantisek Grézl,et al. Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21] Tara N. Sainath,et al. Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[24] Florian Metze,et al. Towards speaker adaptive training of deep neural network acoustic models , 2014, INTERSPEECH.

[25] Jonathan Le Roux,et al. Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[26] Tara N. Sainath,et al. Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[28] Dong Yu,et al. Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[29] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[30] T. Yoshioka,et al. Environmentally robust ASR front-end for deep neural network acoustic models , 2015, Comput. Speech Lang..

[31] Dong Yu,et al. Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[32] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[33] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[34] Yajie Miao,et al. Kaldi+PDNN: Building DNN-based ASR Systems with Kaldi and PDNN , 2014, ArXiv.

[35] Steve Renals,et al. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.