Loss Functions for Deep Monaural Speech Enhancement

Deep neural networks have proven highly effective at speech enhancement, which makes them attractive not just as front-ends for machine listening and speech recognition, but also as enhancement models for the benefit of human listeners. They are, however, usually being trained on loss functions that only assess quality in terms of a minimum mean squared error. This is neglecting the fact that human audio perception functions in a manner far better described by logarithmic measures than linear ones, that psychoacoustic hearing thresholds limit the perceptibility of many signal components in a mixture, and that a degree of continuity of signals may also be expected. Hence, sudden changes in the gain of a system may be detrimental. In the following, we cast these properties of human perception into a form that can aid the optimization of a deep neural network speech enhancement system. We explore their effects on a range of model topologies, showing the efficacy of the proposed modifications.

[1]  Jonathan Le Roux,et al.  Phase Processing for Single-Channel Speech Enhancement: History and recent advances , 2015, IEEE Signal Processing Magazine.

[2]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[3]  W. Marsden I and J , 2012 .

[4]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .

[5]  Laurenz Wiskott,et al.  Independent Slow Feature Analysis and Nonlinear Blind Source Separation , 2004, Neural Computation.

[6]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[7]  Emmanuel Vincent,et al.  Single-channel audio source separation with NMF: divergences, constraints and algorithms , 2018 .

[8]  Hands-free Speech Communications and Microphone Arrays, HSCMA 2017, San Francisco, CA, USA, March 1-3, 2017 , 2017, HSCMA.

[9]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Sam T. Roweis,et al.  Factorial models and refiltering for speech separation and denoising , 2003, INTERSPEECH.

[11]  Steffen Zeiler,et al.  Imperio: Robust Over-the-Air Adversarial Examples for Automatic Speech Recognition Systems , 2019, ACSAC.

[12]  Dorothea Kolossa,et al.  Adversarial Attacks Against Automatic Speech Recognition Systems via Psychoacoustic Hiding , 2018, NDSS.

[13]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[14]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[15]  DeLiang Wang,et al.  A two-stage algorithm for noisy and reverberant speech enhancement , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jun Du,et al.  Joint noise and mask aware training for DNN-based speech enhancement with SUB-band features , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[17]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[18]  Jungwon Lee,et al.  End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization , 2019, ArXiv.

[19]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[20]  Laurenz Wiskott,et al.  Learning invariance manifolds , 1998, Neurocomputing.

[21]  Richard M. Stern,et al.  A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition , 2004, Speech Commun..

[22]  Dorothea Kolossa,et al.  Nonlinear Postprocessing for Blind Speech Separation , 2004, ICA.

[23]  Rainer Martin,et al.  Spectral Subtraction Based on Minimum Statistics , 2001 .

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[26]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[27]  Jesper Jensen,et al.  A short-time objective intelligibility measure for time-frequency weighted noisy speech , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Jesper Jensen,et al.  Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[30]  DeLiang Wang,et al.  Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Laurenz Wiskott,et al.  Utilizing Slow Feature Analysis for Lipreading , 2018, ITG Symposium on Speech Communication.

[32]  DeLiang Wang,et al.  Binaural sound segregation for multisource reverberant environments , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Emmanuel Vincent,et al.  Improved Perceptual Metrics for the Evaluation of Audio Source Separation , 2012, LVA/ICA.

[34]  Laurenz Wiskott,et al.  Gradient-based Training of Slow Feature Analysis by Differentiable Approximate Whitening , 2018, ACML.

[35]  Tao Zhang,et al.  Perceptually Guided Speech Enhancement Using Deep Neural Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  William M. Hartmann,et al.  Psychoacoustics: Facts and Models , 2001 .

[37]  Hugo Fastl,et al.  Psychoacoustics: Facts and Models , 1990 .

[38]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[39]  Jun Du,et al.  Dynamic noise aware training for speech enhancement based on deep neural networks , 2014, INTERSPEECH.

[40]  Naijun Zheng,et al.  Phase-Aware Speech Enhancement Based on Deep Neural Networks , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.