Y$^2$-Net FCRN for Acoustic Echo and Noise Suppression

In recent years, deep neural networks (DNNs) were studied as an alternative to traditional acoustic echo cancellation (AEC) algorithms. The proposed models achieved remarkable performance for the separate tasks of AEC and residual echo suppression (RES). A promising network topology is a fully convolutional recurrent network (FCRN) structure, which has already proven its performance on both noise suppression and AEC tasks, individually. However, the combination of AEC, postfiltering, and noise suppression to a single network typically leads to a noticeable decline in the quality of the near-end speech component due to the lack of a separate loss for echo estimation. In this paper, we propose a two-stage model (Y-Net) which consists of two FCRNs, each with two inputs and one output (Y-Net). The first stage (AEC) yields an echo estimate, which—as a novelty for a DNN AEC model—is further used by the second stage to perform RES and noise suppression. While the subjective listening test of the Interspeech 2021 AEC Challenge mostly yielded results close to the baseline, the proposed method scored an average improvement of 0.46 points over the baseline on the blind testset in double-talk on the instrumental metric DECMOS, provided by the challenge organizers.

[1]  E. Hänsler,et al.  Acoustic Echo and Noise Control: A Practical Approach , 2004 .

[2]  Walter Kellermann,et al.  Combining Adaptive Filtering And Complex-Valued Deep Postfiltering For Acoustic Echo Cancellation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sebastian Braun,et al.  ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, Testing Framework, and Results , 2020 .

[4]  Tim Fingscheidt,et al.  An Efficient Residual Echo Suppression for Multi-Channel Acoustic Echo Cancellation Based on the Frequency-Domain Adaptive Kalman Filter , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  DeLiang Wang,et al.  Deep Learning for Acoustic Echo Cancellation in Noisy and Double-Talk Scenarios , 2018, INTERSPEECH.

[6]  Peter Vary,et al.  Frequency-domain adaptive Kalman filter for acoustic echo control in hands-free telephones , 2006, Signal Process..

[7]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Ali H. Sayed,et al.  Variable step-size NLMS and affine projection algorithms , 2004, IEEE Signal Processing Letters.

[10]  Wouter Tirry,et al.  Separated Noise Suppression and Speech Restoration: Lstm-Based Speech Enhancement in Two Stages , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[11]  Christophe Beaugeant,et al.  Hands-free system with low-delay subband acoustic echo control and noise reduction , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[13]  Tim Fingscheidt,et al.  A Delay-Flexible Stereo Acoustic Echo Cancellation for DFT-Based In-Car Communication (ICC) Systems , 2017, INTERSPEECH.

[14]  Tim Fingscheidt,et al.  Quality assessment of speech enhancement systems by separation of enhanced speech, noise, and echo , 2007, INTERSPEECH.

[15]  Wouter Tirry,et al.  Fully Convolutional Recurrent Networks for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  DeLiang Wang,et al.  Deep Learning for Joint Acoustic Echo and Noise Cancellation with Nonlinear Distortions , 2019, INTERSPEECH.

[17]  Tim Fingscheidt,et al.  Towards objective quality assessment of speech enhancement systems in a black box approach , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[19]  Emmanuel Vincent,et al.  Multiple-Input Neural Network-Based Residual Echo Suppression , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tim Fingscheidt,et al.  Convolutional Neural Networks to Enhance Coded Speech , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Sebastian Braun,et al.  INTERSPEECH 2021 Acoustic Echo Cancellation Challenge , 2021, Interspeech.

[22]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[23]  Walter Kellermann,et al.  Spectral feature-based nonlinear residual echo suppression , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[24]  Gerald Enzner,et al.  State-space architecture of the partitioned-block-based acoustic echo controller , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  T. Fingscheidt,et al.  INSTRUMENTAL SPEECH DISTORTION ASSESSMENT OF BLACK BOX SPEECH ENHANCEMENT SYSTEMS , 2008 .

[26]  Jae Chon Lee,et al.  Block realization of multirate adaptive digital filters , 1986, IEEE Trans. Acoust. Speech Signal Process..

[27]  Wouter Tirry,et al.  INTERSPEECH 2020 Deep Noise Suppression Challenge: A Fully Convolutional Recurrent Network (FCRN) for Joint Dereverberation and Denoising , 2020, INTERSPEECH.

[28]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[29]  Tim Fingscheidt,et al.  AEC in A Netshell: on Target and Topology Choices for FCRN Acoustic Echo Cancellation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).