Phase Recovery with Bregman Divergences for Audio Source Separation

Time-frequency audio source separation is usually achieved by estimating the short-time Fourier transform (STFT) magnitude of each source, and then applying a phase recovery algorithm to retrieve time-domain signals. In particular, the multiple input spectrogram inversion (MISI) algorithm has shown good performance in several recent works. This algorithm minimizes a quadratic reconstruction error between magnitude spectrograms. However, this loss does not properly account for some perceptual properties of audio, and alternative discrepancy measures such as beta-divergences have been preferred in many settings. In this paper, we propose to reformulate phase recovery in audio source separation as a minimization problem involving Bregman divergences. To optimize the resulting objective, we derive a projected gradient descent algorithm. Experiments conducted on a speech enhancement task show that this approach out-performs MISI for several alternative losses, which highlights their relevance for audio source separation applications.

[1]  Jonathan Le Roux,et al.  Phase Reconstruction with Learned Time-Frequency Representations for Single-Channel Speech Separation , 2018, 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC).

[2]  Paul Magron,et al.  Online Spectrogram Inversion for Low-Latency Audio Source Separation , 2020, IEEE Signal Processing Letters.

[3]  Zhong-Qiu Wang,et al.  End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[4]  Peter L. Søndergaard,et al.  A fast Griffin-Lim algorithm , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[5]  Cassia Valentini-Botinhao,et al.  Noisy speech database for training speech enhancement algorithms and TTS models , 2017 .

[6]  R. Gray,et al.  Distortion measures for speech processing , 1980 .

[7]  Lonce L. Wyse,et al.  Real-Time Signal Estimation From Modified Short-Time Fourier Transform Magnitude Spectra , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[9]  Jae S. Lim,et al.  Signal estimation from modified short-time Fourier transform , 1983, ICASSP.

[10]  Paul Magron,et al.  Reducing Interference with Phase Recovery in DNN-based Monaural Singing Voice Separation , 2018, INTERSPEECH.

[11]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[12]  J. Magnus,et al.  Matrix differential calculus with applications to simple, Hadamard, and Kronecker products. , 1985 .

[13]  Paris Smaragdis,et al.  Static and Dynamic Source Separation Using Nonnegative Factorizations: A unified view , 2014, IEEE Signal Processing Magazine.

[14]  Scott Wisdom,et al.  Differentiable Consistency Constraints for Improved Deep Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .

[16]  Junichi Yamagishi,et al.  Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks , 2016, INTERSPEECH.

[17]  Prabhu Babu,et al.  PRIME: Phase Retrieval via Majorization-Minimization , 2015, IEEE Transactions on Signal Processing.

[18]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Deep Sen,et al.  Iterative Phase Estimation for the Synthesis of Separated Sources From Single-Channel Mixtures , 2010, IEEE Signal Processing Letters.

[20]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jon Barker,et al.  The second ‘chime’ speech separation and recognition challenge: Datasets, tasks and baselines , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Mark D. Plumbley,et al.  Musical Source Separation: An Introduction , 2019, IEEE Signal Processing Magazine.

[23]  Paul Magron,et al.  Phase Retrieval With Bregman Divergences and Application to Audio Signal Recovery , 2020, IEEE Journal of Selected Topics in Signal Processing.

[24]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Roland Badeau,et al.  Beta-Divergence as a Subclass of Bregman Divergence , 2011, IEEE Signal Processing Letters.

[26]  S. Uhlich,et al.  Open-Unmix for Speech Enhancement (UMX SE) , 2020 .

[27]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[28]  Fabian-Robert Stöter,et al.  Open-Unmix - A Reference Implementation for Music Source Separation , 2019, J. Open Source Softw..