Spatial speech detection for binaural hearing aids using deep phoneme classifiers

Current hearing aids are limited with respect to speech-specific optimization for spatial sound sources to perform speech enhancement. In this study, we therefore propose an approach for spatial detection of speech based on sound source localization and blind optimization of speech enhancement for binaural hearing aids. We have combined an estimator for the direction of arrival (DOA), featuring high spatial resolution but no specialization to speech, with a measure of speech quality with low spatial resolution obtained after directional filtering. The DOA estimator provides spatial sound source probability in the frontal horizontal plane. The measure of speech quality is based on phoneme representations obtained from a deep neural network, which is part of a hybrid automatic speech recognition (ASR) system. Three ASR-based speech quality measures (ASQM) are explored: entropy, mean temporal distance (M-Measure), matched phoneme (MaP) filtering. We tested the approach in four acoustic scenes with one speaker and either a localized or a diffuse noise source at various signal-to-noise ratios (SNR) in anechoic or reverberant conditions. The effects of incorrect spatial filtering and noise were analyzed. We show that two of the three ASQMs (M-Measure, MaP filtering) are suited to reliably identify the speech target in different conditions. The system is not adapted to the environment and does not require a-priori information about the acoustic scene or a reference signal to estimate the quality of the enhanced speech signal. Nevertheless, our approach performs well in all acoustic scenes tested and varying SNRs and reliably detects incorrect spatial filtering angles.

[1]  Matthew Mattina,et al.  TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids , 2020, INTERSPEECH.

[2]  Jonathan Le Roux,et al.  Unsupervised Speaker Adaptation Using Attention-Based Speaker Memory for End-to-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Hynek Hermansky,et al.  DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters , 2019, Speech Commun..

[4]  John H. L. Hansen,et al.  Advancing Multi-Accented Lstm-CTC Speech Recognition Using a Domain Specific Student-Teacher Learning Paradigm , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[5]  Haizhou Li,et al.  Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Birger Kollmeier,et al.  Predicting speech intelligibility with deep neural networks , 2018, Comput. Speech Lang..

[7]  Daniel Marquardt,et al.  Noise power spectral density estimation for binaural noise reduction exploiting direction of arrival estimates , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[8]  Raymond L. Goldsworthy,et al.  Adaptive spatial filtering improves speech reception in noise while preserving binaural cues. , 2017, The Journal of the Acoustical Society of America.

[9]  DeLiang Wang,et al.  Long short-term memory for speaker generalization in supervised speech separation. , 2017, The Journal of the Acoustical Society of America.

[10]  Birger Kollmeier,et al.  Combining Binaural and Cortical Features for Robust Speech Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Hynek Hermansky,et al.  Predicting error rates for unknown data in automatic speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Thomas Lunner,et al.  Steering of audio input in hearing aids by eye gaze through electrooculography , 2017 .

[13]  Chin-Hui Lee,et al.  A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition , 2016, Neurocomputing.

[14]  Hynek Hermansky,et al.  Performance monitoring for automatic speech recognition in noisy multi-channel environments , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[15]  Hynek Hermansky,et al.  Assessing Speech Quality in Speech-Aware Hearing Aids Based on Phoneme Posteriorgrams , 2016, INTERSPEECH.

[16]  Daniel Marquardt,et al.  Performance Comparison of Bilateral and Binaural MVDR-based Noise Reduction Algorithms in the Presence of DOA Estimation Errors , 2016, ITG Symposium on Speech Communication.

[17]  Volker Hohmann,et al.  A Binaural Steering Beamformer System for Enhancing a Moving Speech Source , 2015, Trends in hearing.

[18]  Tetsuji Ogawa,et al.  Uncertainty estimation of DNN classifiers , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[19]  Anna Warzybok,et al.  Comparing Binaural Pre-processing Strategies III , 2015, Trends in hearing.

[20]  Sharon Gannot,et al.  Theoretical Analysis of Linearly Constrained Multi-Channel Wiener Filtering Algorithms for Combined Noise Reduction and Binaural Cue Preservation in Binaural Hearing Aids , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Jörn Anemüller,et al.  A discriminative learning approach to probabilistic acoustic source localization , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[22]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[23]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Hynek Hermansky,et al.  Mean temporal distance: Predicting ASR error from temporal properties of speech signal , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  P. Stoica,et al.  Robust Adaptive Beamforming , 2013 .

[26]  Gerhard Tröster,et al.  Recognition of Hearing Needs from Body and Eye Movements to Improve Hearing Instruments , 2011, Pervasive.

[27]  Aren Jansen,et al.  Event Selection from Phone Posteriorgrams Using Matched Filters , 2011, INTERSPEECH.

[28]  Aren Jansen,et al.  Point Process Models for Spotting Keywords in Continuous Speech , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[29]  Volker Hohmann,et al.  Database of Multichannel In-Ear and Behind-the-Ear Head-Related and Binaural Room Impulse Responses , 2009, EURASIP J. Adv. Signal Process..

[30]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[31]  Jon Barker,et al.  Modelling speaker intelligibility in noise , 2007, Speech Commun..

[32]  Naveen Parihar,et al.  Performance analysis of the Aurora large vocabulary baseline system , 2004, 2004 12th European Signal Processing Conference.

[33]  Wouter A. Dreschler,et al.  ICRA Noises: Artificial Noise Signals with Speech-like Spectral and Temporal Properties for Hearing Instrument Assessment: Ruidos ICRA: Señates de ruido artificial con espectro similar al habla y propiedades temporales para pruebas de instrumentos auditivos , 2001 .

[34]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[35]  Jerome R. Bellegarda,et al.  Using a sigmoid transformation for improved modeling of phoneme duration , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[36]  W. Dreschler,et al.  Artificial noise signals with speechlike spectral and temporal properties for hearing instrument assessment , 1999 .

[37]  Alexandros Potamianos,et al.  Multi-band speech recognition in noisy environments , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[38]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[39]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .