Hybrid Neural Networks for On-device Directional Hearing

On-device directional hearing requires audio source separation from a given direction while achieving stringent human-imperceptible latency requirements. While neural nets can achieve significantly better performance than traditional beamformers, all existing models fall short of supporting low-latency causal inference on computationally-constrained wearables. We present HybridBeam, a hybrid model that combines traditional beamformers with a custom lightweight neural net. The former reduces the computational burden of the latter and also improves its generalizability, while the latter is designed to further reduce the memory and computational overhead to enable real-time and low-latency operations. Our evaluation shows comparable performance to state-of-the-art causal inference models on synthetic data while achieving a 5x reduction of model size, 4x reduction of computation per second, 5x reduction in processing time and generalizing better to real hardware data. Further, our real-time hybrid model runs in 8 ms on mobile CPUs designed for low-power wearable devices and achieves an end-to-end latency of 17.5 ms.

[1]  Sriram Srinivasan,et al.  Low-bandwidth binaural beamforming , 2008 .

[2]  Josh H. McDermott The cocktail party problem , 2009, Current Biology.

[3]  Katharina Rifai,et al.  Accuracy and precision of the HTC VIVE PRO eye tracking in head-restrained and head-free conditions , 2020 .

[4]  Mark Hasegawa-Johnson,et al.  Deep Learning Based Speech Beamforming , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Takuya Yoshioka,et al.  End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tao Zhang,et al.  An effective low complexity binaural beamforming algorithm for hearing aids , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[7]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[8]  Jacob Benesty,et al.  The MVDR Beamformer for Speech Enhancement , 2010 .

[9]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Ira Kemelmacher-Shlizerman,et al.  The Cone of Silence: Speech Separation by Localization , 2020, NeurIPS.

[11]  Junichi Yamagishi,et al.  CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit , 2017 .

[12]  Dong Yu,et al.  Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information , 2019, INTERSPEECH.

[13]  Francis Bach,et al.  Music Source Separation in the Waveform Domain , 2019, ArXiv.

[14]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Bhiksha Raj,et al.  Microphone array processing for distant speech recognition: Towards real-world deployment , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[16]  Dong Yu,et al.  Neural Spatio-Temporal Beamformer for Target Speech Separation , 2020, INTERSPEECH.

[17]  Lei Xie,et al.  DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement , 2020, INTERSPEECH.

[18]  Yuexian Zou,et al.  Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation , 2020, ArXiv.

[19]  M. Viberg,et al.  Two decades of array signal processing research: the parametric approach , 1996, IEEE Signal Process. Mag..

[20]  Xiong Xiao,et al.  Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Brian C. J. Moore,et al.  Tolerable Hearing Aid Delays. V. Estimation of Limits for Open Canal Fittings , 2008, Ear and hearing.

[22]  Simon Doclo,et al.  DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement , 2019, ArXiv.

[23]  Gu-Yeon Wei,et al.  Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.

[24]  Marc Moonen,et al.  Acoustic Beamforming for Hearing Aid Applications , 2010 .

[25]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[26]  P. Smaragdis,et al.  Sudo RM -RF: Efficient Networks for Universal Audio Source Separation , 2020, International Workshop on Machine Learning for Signal Processing.

[27]  Sandeep Subramanian,et al.  Deep Complex Networks , 2017, ICLR.

[28]  Giulio Sandini,et al.  Spatially selective binaural hearing aids , 2015, UbiComp/ISWC Adjunct.

[29]  M A Stone,et al.  Tolerable hearing aid delays. I. Estimation of limits imposed by the auditory path alone using simulated hearing losses. , 1999, Ear and hearing.

[30]  Tomohiro Nakatani,et al.  Online MVDR Beamformer Based on Complex Gaussian Mixture Model With Spatial Prior for Noise Robust ASR , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Yong Xu,et al.  ADL-MVDR: All Deep Learning MVDR Beamformer for Target Speech Separation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Nima Mesgarani,et al.  Real-Time Binaural Speech Separation with Preserved Spatial Cues , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Xiong Xiao,et al.  Cracking the cocktail party problem by multi-beam deep attractor network , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[34]  Chng Eng Siong,et al.  On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Pengfei Zhang,et al.  High Performance Depthwise and Pointwise Convolutions on Mobile Devices , 2020, AAAI.

[36]  Jacob Benesty,et al.  A Study of the LCMV and MVDR Noise Reduction Filters , 2010, IEEE Transactions on Signal Processing.

[37]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[38]  Shih-Chii Liu,et al.  FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[39]  Yong Xu,et al.  End-to-End Multi-Channel Speech Separation , 2019, ArXiv.

[40]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[42]  Matthew Mattina,et al.  TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids , 2020, INTERSPEECH.

[43]  Tao Zhang,et al.  Comparison of two binaural beamforming approaches for hearing aids , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Johannes Gehrke,et al.  A scalable noisy speech dataset and online subjective test framework , 2019, INTERSPEECH.

[45]  Ivan Dokmanic,et al.  Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).