Fusion of Amplitude and Complex Domains based on Deep Neural Networks for Speech Enhancement

Most of the recent works on speech enhancement has estimated the domain of clean spectrum and directly added noisy phase to the estimated domain without any processing. Nowadays, most of the phase-aware systems in speech processing are made up of using both real and imaginary parts of speech spectrum rather than the raw phase. In this paper, we propose a novel approach by the fusion of two deep methods for speech enhancement in the complex domain. This method combines the output of two deep neural networks (DNN) that estimate the complex ideal ratio mask (cIRM) and the amplitude of clean speech with a new logarithmic-based decision rule. This fusion rule which has been proposed according to psychoacoustics findings and spectrogram observations produces a complementary structure. Hence, it is capable of using the advantages of both amplitude and complex mask estimators in each time-frequency region. The above method, when evaluated on TIMIT corpus, outperforms the perceptual evaluation of speech quality (PESQ) compared to other approaches especially in unseen noise conditions showing the better generalization of the proposed architecture.

[1]  Saeed Gazor,et al.  On the distribution of Mel-filtered log-spectrum of speech in additive noise , 2015, Speech Commun..

[2]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[3]  Yi Hu,et al.  Subjective Comparison of Speech Enhancement Algorithms , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tao Zhang,et al.  DNN-based enhancement of noisy and reverberant speech , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  S. El-Rabaie,et al.  Speech enhancement with an adaptive Wiener filter , 2013, International Journal of Speech Technology.

[7]  Sanaz Seyedin,et al.  Modular dynamic deep denoising autoencoder for speech enhancement , 2017, 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE).

[8]  Meng Sun,et al.  Speech Enhancement Under Low SNR Conditions Via Noise Estimation Using Sparse and Low-Rank NMF with Kullback–Leibler Divergence , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Ke Tan,et al.  Complex Spectral Mapping with a Convolutional Recurrent Network for Monaural Speech Enhancement , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yu Tsao,et al.  Complex spectrogram enhancement by convolutional neural network with multi-metrics learning , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[11]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  L. Atlas,et al.  Single-Channel Source Separation Using Complex Matrix Factorization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[14]  Hirokazu Kameoka,et al.  Complex NMF: A new sparse representation for acoustic signals , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Sanaz Seyedin,et al.  NMF-based Cepstral Features for Speech Emotion Recognition , 2018, 2018 4th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS).

[17]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.