Differentiable Consistency Constraints for Improved Deep Speech Enhancement

In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system’s output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.

[1]  Nicolas Sturmel,et al.  Iterative phase reconstruction of wiener filtered signals , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  John R. Hershey,et al.  Exploring Tradeoffs in Models for Low-Latency Speech Enhancement , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[3]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[4]  DeLiang Wang,et al.  Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Jonathan Le Roux,et al.  Consistent anisotropic wiener filtering for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[7]  Hong-Goo Kang,et al.  Phase-Sensitive Joint Learning Algorithms for Deep Learning-Based Speech Enhancement , 2018, IEEE Signal Processing Letters.

[8]  Hirokazu Kameoka,et al.  Consistent Wiener Filtering: Generalized Time-Frequency Masking Respecting Spectrogram Consistency , 2010, LVA/ICA.

[9]  Richard M. Stern,et al.  Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis , 2008, INTERSPEECH.

[10]  Jonathan Le Roux,et al.  FAST SIGNAL RECONSTRUCTION FROM MAGNITUDE STFT SPECTROGRAM BASED ON SPECTROGRAM CONSISTENCY , 2010 .

[11]  DeLiang Wang,et al.  A New Framework for Supervised Speech Enhancement in the Time Domain , 2018, INTERSPEECH.

[12]  DeLiang Wang,et al.  Complex ratio masking for joint enhancement of magnitude and phase , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Deep Sen,et al.  Iterative Phase Estimation for the Synthesis of Separated Sources From Single-Channel Mixtures , 2010, IEEE Signal Processing Letters.

[14]  Nima Mesgarani,et al.  TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation. , 2018 .

[15]  Jonathan Le Roux,et al.  Consistent Wiener Filtering for Audio Source Separation , 2013, IEEE Signal Processing Letters.

[16]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jonathan Le Roux,et al.  Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction , 2008, SAPA@INTERSPEECH.

[18]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[21]  John R. Hershey,et al.  Phasebook and Friends: Leveraging Discrete Representations for Source Separation , 2018, IEEE Journal of Selected Topics in Signal Processing.