Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection

In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio. In these instances, the performance of tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD) can degrade significantly. To address this problem, we propose an implicit acoustic echo cancellation (iAEC) framework where a neural network is trained to exploit the additional information from a reference microphone channel to learn to ignore the interfering signal and improve detection performance. We study this framework for the tasks of KWS and DDD on, respectively, an augmented version of Google Speech Commands v2 and a real-world Alexa device dataset. Notably, we show a 56% reduction in false-reject rate for the DDD task during device playback conditions. We also show comparable or superior performance over a strong end-to-end neural echo cancellation + KWS baseline for the KWS task with an order of magnitude less computational requirements.

[1]  DeLiang Wang,et al.  Deep Learning for Acoustic Echo Cancellation in Noisy and Double-Talk Scenarios , 2018, INTERSPEECH.

[2]  Jungwon Lee,et al.  Deep Multitask Acoustic Echo Cancellation , 2019, INTERSPEECH.

[3]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[4]  Quan Wang,et al.  Textual Echo Cancellation , 2020, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Daniel Willett,et al.  Exploring Attention Mechanism for Acoustic-based Classification of Speech Utterances into System-directed and Non-system-directed , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Liang Chen,et al.  Deep Neural Network Based Regression Approach for Acoustic Echo Cancellation , 2019, ICMSSP 2019.

[7]  Che-Wei Huang,et al.  A Study for Improving Device-Directed Speech Detection Toward Frictionless Human-Machine Interaction , 2019, INTERSPEECH.

[8]  Ron J. Weiss,et al.  Unsupervised Sound Separation Using Mixture Invariant Training , 2020, NeurIPS.

[9]  Henrique S. Malvar,et al.  Nonlinear residual acoustic echo suppression for high levels of harmonic distortion , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Ioannis C. Konstantakopoulos,et al.  Improving Device Directedness Classification of Utterances With Semantic Lexical Features , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Heiga Zen,et al.  Speech Processing for Digital Home Assistants: Combining signal processing with deep-learning techniques , 2019, IEEE Signal Processing Magazine.

[13]  DeLiang Wang,et al.  Deep Learning for Joint Acoustic Echo and Noise Cancellation with Nonlinear Distortions , 2019, INTERSPEECH.

[14]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[15]  Antonio Miguel,et al.  gpuRIR: A python library for room impulse response simulation with GPU acceleration , 2018, Multimedia Tools and Applications.

[16]  Boris Ginsburg,et al.  MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition , 2020, INTERSPEECH.

[17]  Jungwon Lee,et al.  CAD-AEC: Context-Aware Deep Acoustic Echo Cancellation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[19]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[20]  Alex Park,et al.  A Neural Acoustic Echo Canceller Optimized Using An Automatic Speech Recognizer and Large Scale Synthetic Data , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  E. Hänsler,et al.  Acoustic Echo and Noise Control: A Practical Approach , 2004 .

[22]  Mahesh Chandra,et al.  Non-linear adaptive echo supression algorithms: A technical survey , 2014, 2014 International Conference on Communication and Signal Processing.

[23]  Stefano Squartini,et al.  Detecting and Counting Overlapping Speakers in Distant Speech Scenarios , 2020, INTERSPEECH.

[24]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Walter Kellermann,et al.  Spectral feature-based nonlinear residual echo suppression , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[26]  Karin Ackermann,et al.  Advances In Network And Acoustic Echo Cancellation , 2016 .