Convolutional Neural Network-based Speech Enhancement for Cochlear Implant Recipients

Attempts to develop speech enhancement algorithms with improved speech intelligibility for cochlear implant (CI) users have met with limited success. To improve speech enhancement methods for CI users, we propose to perform speech enhancement in a cochlear filter-bank feature space, a feature-set specifically designed for CI users based on CI auditory stimuli. We leverage a convolutional neural network (CNN) to extract both stationary and non-stationary components of environmental acoustics and speech. We propose three CNN architectures: (1) vanilla CNN that directly generates the enhanced signal; (2) spectral-subtraction-style CNN (SS-CNN) that first predicts noise and then generates the enhanced signal by subtracting noise from the noisy signal; (3) Wiener-style CNN (Wiener-CNN) that generates an optimal mask for suppressing noise. An important problem of the proposed networks is that they introduce considerable delays, which limits their real-time application for CI users. To address this, this study also considers causal variations of these networks. Our experiments show that the proposed networks (both causal and non-causal forms) achieve significant improvement over existing baseline systems. We also found that causal Wiener-CNN outperforms other networks, and leads to the best overall envelope coefficient measure (ECM). The proposed algorithms represent a viable option for implementation on the CCi-MOBILE research platform as a pre-processor for CI users in naturalistic environments.

[1]  John H. L. Hansen,et al.  Compensation for Domain Mismatch in Text-independent Speaker Recognition , 2018, INTERSPEECH.

[2]  Fan-Gang Zeng,et al.  Cochlear Implants: System Design, Integration, and Evaluation , 2008, IEEE Reviews in Biomedical Engineering.

[3]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[4]  John H. L. Hansen,et al.  Speech Enhancement Based on Generalized Minimum Mean Square Error Estimators and Masking Properties of the Auditory System , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Khadija Akter,et al.  Predicting Speech Intelligibility with the Regeneration of Envelope from TFS Cues for Hearing Impaired Listeners , 2019, 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE).

[6]  Nursadul Mamun,et al.  CCi-MOBILE: Design and Evaluation of a Cochlear Implant and Hearing Aid Research Platform for Speech Scientists and Engineers , 2019, 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI).

[7]  Emily Mower Provost,et al.  Capturing Long-Term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition , 2017, INTERSPEECH.

[8]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  John H. L. Hansen,et al.  Quantifying Cochlear Implant Users' Ability for Speaker Identification using CI Auditory Stimuli , 2019, INTERSPEECH.

[10]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[11]  R. Shannon,et al.  Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants. , 2001, The Journal of the Acoustical Society of America.

[12]  John H. L. Hansen,et al.  In-Vehicle Speech and Noise Corpora , 2012 .

[13]  John H L Hansen,et al.  Speech enhancement for cochlear implant recipients. , 2018, The Journal of the Acoustical Society of America.

[14]  Jessica J. M. Monaghan,et al.  Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users , 2017, Hearing Research.

[15]  John H. L. Hansen,et al.  Measuring speech perception with recovered envelope cues using the peripheral auditory model , 2018, The Journal of the Acoustical Society of America.

[16]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[17]  Wissam A. Jassim,et al.  Prediction of Speech Intelligibility Using a Neurogram Orthogonal Polynomial Measure (NOPM) , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[19]  Yi Hu,et al.  Speech enhancement based on wavelet thresholding the multitaper spectrum , 2004, IEEE Transactions on Speech and Audio Processing.

[20]  Hadi Veisi,et al.  An optimum MMSE post-filter for Adaptive Noise Cancellation in automobile environment , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[21]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[22]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[24]  John H. L. Hansen,et al.  The CCi-MOBILE Vocoder , 2018 .

[25]  Emily Mower Provost,et al.  Progressive Neural Networks for Transfer Learning in Emotion Recognition , 2017, INTERSPEECH.

[26]  Sumit Agrawal,et al.  Cochlear implant failures and reimplantation: A 30‐year analysis and literature review , 2020, The Laryngoscope.

[27]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[28]  Soheil Khorram,et al.  Probabilistic Permutation Invariant Training for Speech Separation , 2019, INTERSPEECH.

[29]  Philipos C Loizou,et al.  Predicting the speech reception threshold of cochlear implant listeners using an envelope-correlation based measure. , 2012, The Journal of the Acoustical Society of America.

[30]  Emily Mower Provost,et al.  Jointly Aligning and Predicting Continuous Emotion Annotations , 2019, ArXiv.

[31]  John H. L. Hansen,et al.  An Auditory-Masking-Threshold-Based Noise Suppression Algorithm GMMSE-AMT[ERB] for Listeners with Sensorineural Hearing Loss , 2005, EURASIP J. Adv. Signal Process..

[32]  Yu Tsao,et al.  Raw waveform-based speech enhancement by fully convolutional networks , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[33]  Tobias Moser,et al.  Near physiological spectral selectivity of cochlear optogenetics , 2019, Nature Communications.