论文信息 - Quaternion Neural Networks for Multi-channel Distant Speech Recognition

Quaternion Neural Networks for Multi-channel Distant Speech Recognition

Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal. In this paper, we propose to capture these inter- and intra- structural dependencies with quaternion neural networks, which can jointly process multiple signals as whole quaternion entities. The quaternion algebra replaces the standard dot product with the Hamilton one, thus offering a simple and elegant way to model dependencies between elements. The quaternion layers are then coupled with a recurrent neural network, which can learn long-term dependencies in the time domain. We show that a quaternion long-short term memory neural network (QLSTM), trained on the concatenated multi-channel speech signals, outperforms equivalent real-valued LSTM on two different tasks of multi-channel distant speech recognition.

[1] Jon Barker,et al. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[2] Titouan Parcollet,et al. A survey of quaternion neural networks , 2019, Artificial Intelligence Review.

[3] Danilo Comminiello,et al. Quaternion Convolutional Neural Networks for Detection and Localization of 3D Sound Events , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Titouan Parcollet,et al. The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Titouan Parcollet,et al. Quaternion Convolutional Neural Networks for Heterogeneous Image Processing , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Titouan Parcollet,et al. Quaternion Recurrent Neural Networks , 2018, ICLR.

[7] Shih-Chii Liu,et al. Multi-channel Attention for End-to-End Speech Recognition , 2018, INTERSPEECH.

[8] Ying Zhang,et al. Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition , 2018, INTERSPEECH.

[9] Anthony S. Maida,et al. Deep Quaternion Networks , 2017, 2018 International Joint Conference on Neural Networks (IJCNN).

[10] Mirco Ravanelli,et al. Deep Learning for Distant Speech Recognition , 2017, ArXiv.

[11] John R. Hershey,et al. Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming , 2017, IEEE Journal of Selected Topics in Signal Processing.

[12] Ian Lane,et al. End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition , 2017, INTERSPEECH.

[13] Mitch Weintraub,et al. Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[14] Reinhold Häb-Umbach,et al. Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Tara N. Sainath,et al. Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition , 2016, INTERSPEECH.

[16] Maurizio Omologo,et al. Realistic Multi-Microphone Data Simulation for Distant Speech Recognition , 2016, INTERSPEECH.

[17] Tara N. Sainath,et al. Complex Linear Projection (CLP): A Discriminative Approach to Joint Feature Extraction and Acoustic Modeling , 2016, INTERSPEECH.

[18] Liang Lu,et al. Deep beamforming networks for multi-channel speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Maurizio Omologo,et al. Contaminated speech training methods for robust DNN-HMM distant speech recognition , 2017, INTERSPEECH.

[20] Maurizio Omologo,et al. The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21] Yifan Gong,et al. Robust automatic speech recognition : a bridge to practical application , 2015 .

[22] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Thomas Hain,et al. Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Petros Maragos,et al. The DIRHA simulated corpus , 2014, LREC.

[25] Maurizio Omologo,et al. On the selection of the impulse responses for distant-speech recognition based on contaminated speech training , 2014, INTERSPEECH.

[26] Maurizio Omologo,et al. Impulse response estimation for robust speech recognition in a reverberant environment , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[27] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[28] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[29] Lakhmi C. Jain,et al. Knowledge-Based and Intelligent Information and Engineering Systems , 2011, Lecture Notes in Computer Science.

[30] John McDonough,et al. Distant Speech Recognition , 2009 .

[31] Boaz Rafaely,et al. Microphone Array Signal Processing , 2008 .

[32] Walter Kellermann,et al. Beamforming for Speech and Audio Signals , 2008 .

[33] Nobuyuki Matsui,et al. Quaternion Neural Network and Its Application , 2003, KES.

[34] Matti Hämäläinen,et al. Filter-and-sum beamformer with adjustable filter characteristics , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[35] Klaus Uwe Simmer,et al. Superdirective Microphone Arrays , 2001, Microphone Arrays.

[36] Michael S. Brandstein,et al. Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[37] Giovanni Muscato,et al. Multilayer Perceptrons to Approximate Quaternion Valued Functions , 1997, Neural Networks.

[38] Luigi Fortuna,et al. Neural networks for quaternion-valued function approximation , 1994, Proceedings of IEEE International Symposium on Circuits and Systems - ISCAS '94.

[39] Jonathan G. Fiscus,et al. DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[40] G. Carter,et al. The generalized correlation method for estimation of time delay , 1976 .

[41] William Rowan Hamilton,et al. Elements of Quaternions , 1969 .