FusionRNN: Shared Neural Parameters for Multi-Channel Distant Speech Recognition

Distant speech recognition remains a challenging application for modern deep learning based Automatic Speech Recognition (ASR) systems, due to complex recording conditions involving noise and reverberation. Multiple microphones are commonly combined with well-known speech processing techniques to enhance the original signals and thus enhance the speech recognizer performance. These multi-channel follow similar input distributions with respect to the global speech information but also contain an important part of noise. Consequently, the input representation robustness is key to obtaining reasonable recognition rates. In this work, we propose a Fusion Layer (FL) based on shared neural parameters. We use it to produce an expressive embedding of multiple microphone signals, that can easily be combined with any existing ASR pipeline. The proposed model called FusionRNN showed promising results on a multi-channel distant speech recognition task, and consistently outperformed baseline models while maintaining an equal training time.

[1]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[2]  Mirco Ravanelli,et al.  Deep Learning for Distant Speech Recognition , 2017, ArXiv.

[3]  Nobuyuki Matsui,et al.  Quaternion Neural Network and Its Application , 2003, KES.

[4]  Ian Lane,et al.  End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition , 2017, INTERSPEECH.

[5]  Tara N. Sainath,et al.  Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[6]  Michael Vorländer,et al.  Handbook of signal processing in acoustics , 2008 .

[7]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[8]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[10]  Maurizio Omologo,et al.  Contaminated speech training methods for robust DNN-HMM distant speech recognition , 2017, INTERSPEECH.

[11]  Jean-François Bonastre,et al.  Similarity Metric Based on Siamese Neural Networks for Voice Casting , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[13]  Yifan Gong,et al.  Robust automatic speech recognition : a bridge to practical application , 2015 .

[14]  Yoshua Bengio,et al.  Light Gated Recurrent Units for Speech Recognition , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[15]  Line Katrine Harder Clemmensen,et al.  Weight Sharing and Deep Learning for Spectral Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Gerald Sommer,et al.  On Clifford neurons and Clifford multi-layer perceptrons , 2008, Neural Networks.

[17]  Jon Barker,et al.  The second ‘CHiME’ speech separation and recognition challenge: An overview of challenge systems and outcomes , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[18]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[19]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Xiaofei Wang,et al.  A Practical Two-Stage Training Strategy for Multi-Stream End-to-End Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Walter Kellermann,et al.  Beamforming for Speech and Audio Signals , 2008 .

[24]  Akira Hirose,et al.  Complex-Valued Neural Networks: Theories and Applications , 2003 .

[25]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[26]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[27]  Maurizio Omologo,et al.  The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[28]  John R. Hershey,et al.  Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[29]  Matti Hämäläinen,et al.  Filter-and-sum beamformer with adjustable filter characteristics , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[30]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[31]  Titouan Parcollet,et al.  A survey of quaternion neural networks , 2019, Artificial Intelligence Review.

[32]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Mohamed Morchid,et al.  Theme identification in telephone service conversations using quaternions of speech features , 2013, INTERSPEECH.

[34]  Maurizio Omologo,et al.  Realistic Multi-Microphone Data Simulation for Distant Speech Recognition , 2016, INTERSPEECH.

[35]  Petros Maragos,et al.  The DIRHA simulated corpus , 2014, LREC.

[36]  Titouan Parcollet,et al.  Quaternion Convolutional Neural Networks for Heterogeneous Image Processing , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).