Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition

Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this work, we develop new acoustic modeling techniques that optimize spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input based on an ASR criterion directly. In contrast to conventional methods, we incorporate array processing knowledge into the acoustic model. Moreover, we initialize the network with beamformers’ coefficients. We investigate effects of such MC neural networks through ASR experiments on the real-world far-field data where users are interacting with an ASR system in uncontrolled acoustic environments. We show that our MC acoustic model can reduce a word error rate (WER) by 16.5% compared to a single channel ASR system with the traditional log-mel filter bank energy (LFBE) feature on average. Our result also shows that our network with the spatial filtering layer on two-channel input achieves a relative WER reduction of 9.5% compared to conventional beamforming with seven microphones.

[1]  Hermann Ney,et al.  Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Masakiyo Fujimoto Factored Deep Convolutional Neural Networks for Noise Robust Speech Recognition , 2017, INTERSPEECH.

[3]  Liang Lu,et al.  Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Steve Renals,et al.  Convolutional Neural Networks for Distant Speech Recognition , 2014, IEEE Signal Processing Letters.

[5]  ON HIDDEN MARKOV MODEL MAXIMUM NEGENTROPY BEAMFORMING , 2008 .

[6]  Richard M. Stern,et al.  Likelihood-maximizing beamforming for robust hands-free speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[7]  Sree Hari Krishnan Parthasarathi,et al.  Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-student Learning , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  M.L. Seltzer Bridging the Gap: Towards a Unified Framework for Hands-Free Speech Recognition Using Microphone Arrays , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[9]  Nikko Strom,et al.  Data Augmentation for Robust Keyword Spotting under Playback Interference , 2018, ArXiv.

[10]  John R. Hershey,et al.  Multichannel End-to-end Speech Recognition , 2017, ICML.

[11]  DeLiang Wang,et al.  Binaural Sound Localization , 2006 .

[12]  Nikko Strom,et al.  Direct modeling of raw audio with DNNS for wake word detection , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Bhiksha Raj,et al.  Microphone array processing for distant speech recognition: Towards real-world deployment , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[14]  Florian Metze,et al.  New Era for Robust Speech Recognition , 2017, Springer International Publishing.

[15]  Maurizio Omologo,et al.  Speech Recognition with Microphone Arrays , 2001, Microphone Arrays.

[16]  Shrikanth S. Narayanan,et al.  An Overview on Perceptually Motivated Audio Indexing and Classification , 2013, Proceedings of the IEEE.

[17]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[18]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[19]  Richard Rose,et al.  Architectures for deep neural network based acoustic models defined over windowed speech waveforms , 2015, INTERSPEECH.

[20]  Bhiksha Raj,et al.  Techniques for Noise Robustness in Automatic Speech Recognition , 2012, Techniques for Noise Robustness in Automatic Speech Recognition.

[21]  Sridha Sridharan,et al.  Clustered Blind Beamforming From Ad-Hoc Microphone Arrays , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[23]  Shih-Chii Liu,et al.  Multi-channel Attention for End-to-End Speech Recognition , 2018, INTERSPEECH.

[24]  Thomas M. Sullivan,et al.  Multi-microphone correlation-based processing for robust automatic speech recognition , 1996 .

[25]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[26]  M. Wolfel,et al.  Distant Speech Recognition: Bridging the Gaps , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[27]  Ian Lane,et al.  Recurrent Models for Auditory Attention in Multi-Microphone Distant Speech Recognition , 2016, INTERSPEECH.

[28]  F. Wagner,et al.  Del. , 2019, Blood transfusion = Trasfusione del sangue.

[29]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[30]  Sree Hari Krishnan Parthasarathi,et al.  Robust Speech Recognition via Anchor Word Representations , 2017, INTERSPEECH.

[31]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[32]  Jonathan G. Fiscus,et al.  The Rich Transcription 2007 Meeting Recognition Evaluation , 2007, CLEAR.

[33]  Marc Moonen,et al.  Superdirective Beamforming Robust Against Microphone Mismatch , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Sree Hari Krishnan Parthasarathi,et al.  Lessons from Building Acoustic Models with a Million Hours of Speech , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Nicolas Usunier,et al.  End-to-End Speech Recognition From the Raw Waveform , 2018, INTERSPEECH.

[36]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[37]  Tara N. Sainath,et al.  Performance of Mask Based Statistical Beamforming in a Smart Home Scenario , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Harry L. Van Trees,et al.  Optimum Array Processing , 2002 .

[39]  Tomohiro Nakatani,et al.  Frame-by-Frame Closed-Form Update for Mask-Based Adaptive MVDR Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Tara N. Sainath,et al.  Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.