Speaker-Aware Deep Denoising Autoencoder with Embedded Speaker Identity for Speech Enhancement

Previous studies indicate that noise and speaker variations can degrade the performance of deep-learning–based speech– enhancement systems. To increase the system performance over environmental variations, we propose a novel speaker-aware system that integrates a deep denoising autoencoder (DDAE) with an embedded speaker identity. The overall system first extracts embedded speaker identity features using a neural network model; then the DDAE takes the augmented features as input to generate enhanced spectra. With the additional embedded features, the speech-enhancement system can be guided to generate the optimal output corresponding to the speaker identity. We tested the proposed speech-enhancement system on the TIMIT dataset. Experimental results showed that the proposed speech-enhancement system could improve the sound quality and intelligibility of speech signals from additive noisecorrupted utterances. In addition, the results suggested system robustness for unseen speakers when combined with speaker features.

[1]  Zhong-Qiu Wang,et al.  A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  G. K.,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2017 .

[5]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[6]  Jacob Benesty,et al.  Speech Enhancement (Signals and Communication Technology) , 2005 .

[7]  Peter Vary,et al.  Speech Enhancement by MAP Spectral Amplitude Estimation Using a Super-Gaussian Speech Model , 2005, EURASIP J. Adv. Signal Process..

[8]  David Malah,et al.  Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[10]  Jacob Benesty,et al.  Fundamentals of Noise Reduction , 2008 .

[11]  Yu Tsao,et al.  Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks , 2017, INTERSPEECH.

[12]  Andrew J. R. Simpson Probabilistic Binary-Mask Cocktail-Party Source Separation in a Convolutional Deep Neural Network , 2015, ArXiv.

[13]  Yu Tsao,et al.  A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation , 2017, IEEE Transactions on Biomedical Engineering.

[14]  Yannan Yannan Wang,et al.  A Gender Mixture Detection Approach to Unsupervised Single-Channel Speech Separation Based on Deep Neural Networks , 2017, TASLP.

[15]  Philipos C. Loizou,et al.  Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrum , 2005, IEEE Transactions on Speech and Audio Processing.

[16]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Yu Tsao,et al.  Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[18]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[19]  Jun Du,et al.  A unified DNN approach to speaker-dependent simultaneous speech enhancement and speech separation in low SNR environments , 2017, Speech Commun..

[20]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Changchun Bao,et al.  Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification , 2014, Speech Commun..

[22]  DeLiang Wang,et al.  A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Kuldip K. Paliwal,et al.  Single-channel speech enhancement using spectral subtraction in the short-time modulation domain , 2010, Speech Commun..

[24]  Jun Du,et al.  A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech , 2018, J. Signal Process. Syst..

[25]  Jun Du,et al.  A unified speaker-dependent speech separation and enhancement system based on deep neural networks , 2015, 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP).

[26]  Marc Moonen,et al.  Reduced-Bandwidth and Distributed MWF-Based Noise Reduction Algorithms for Binaural Hearing Aids , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Rahim Saeidi,et al.  Target speaker separation in a multisource environment using speaker-dependent postfilter and noise estimation , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Yu Tsao,et al.  Speech enhancement based on deep denoising autoencoder , 2013, INTERSPEECH.

[29]  John R. Hershey,et al.  VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.

[30]  Yu Tsao,et al.  Discriminative autoencoders for speaker verification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Hugo Van hamme,et al.  Exemplar-based speech enhancement for deep neural network based automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yu Tsao,et al.  SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement , 2016, INTERSPEECH.

[33]  Jun Du,et al.  Multiple-target deep learning for LSTM-RNN based speech enhancement , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[34]  Tao Zhang,et al.  A novel target speaker dependent postfiltering approach for multichannel speech enhancement , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[35]  Pejman Mowlaee Begzade Mahale,et al.  Speaker dependent speech enhancement using sinusoidal model , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).