A Flow-Based Neural Network for Time Domain Speech Enhancement

Speech enhancement involves the distinction of a target speech signal from an intrusive background. Although generative approaches using Variational Autoencoders or Generative Adversarial Networks (GANs) have increasingly been used in recent years, normalizing flow (NF) based systems are still scarse, despite their success in related fields. Thus, in this paper we propose a NF framework to directly model the enhancement process by density estimation of clean speech utterances conditioned on their noisy counterpart. The WaveGlow model from speech synthesis is adapted to enable direct enhancement of noisy utterances in time domain. In addition, we demonstrate that nonlinear input companding benefits the model performance by equalizing the distribution of input samples. Experimental evaluation on a publicly available dataset shows comparable results to current state-of-the-art GAN-based approaches, while surpassing the chosen baselines using objective evaluation metrics.

[1]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[2]  Pieter Abbeel,et al.  Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design , 2019, ICML.

[3]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[4]  Laurent Girin,et al.  A Recurrent Variational Autoencoder for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Ivan Kobyzev,et al.  Normalizing Flows: An Introduction and Review of Current Methods , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Shou-De Lin,et al.  MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement , 2019, ICML.

[7]  Vassilis Tsiaras,et al.  Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN , 2019, INTERSPEECH.

[8]  Alan V. Oppenheim,et al.  All-pole modeling of degraded speech , 1978 .

[9]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[10]  Hemant A. Patil,et al.  Time-Frequency Masking-Based Speech Enhancement Using Generative Adversarial Network , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Gorjan Alagic,et al.  #p , 2019, Quantum information & computation.

[12]  Richard Heusdens,et al.  A STUDY OF THE DISTRIBUTION OF TIME-DOMAIN SPEECH SAMPLES AND DISCRETE FOURIER COEFFICIENTS , 2005 .

[13]  Kazuyoshi Yoshii,et al.  A Flow-Based Deep Latent Variable Model for Speech Spectrogram Modeling and Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[15]  Yariv Ephraim,et al.  A signal subspace approach for speech enhancement , 1995, IEEE Trans. Speech Audio Process..

[16]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  John H. L. Hansen,et al.  An effective quality evaluation protocol for speech enhancement algorithms , 1998, ICSLP.

[18]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[19]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[20]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Emmanuel Vincent,et al.  A Statistically Principled and Computationally Efficient Approach to Speech Enhancement using Variational Autoencoders , 2019, INTERSPEECH.

[22]  Kurt Keutzer,et al.  SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis , 2020, ArXiv.

[23]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Alastair H. Moore,et al.  Speech enhancement for robust automatic speech recognition: Evaluation using a baseline system and instrumental measures , 2017, Comput. Speech Lang..

[25]  Junichi Yamagishi,et al.  Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks , 2016, INTERSPEECH.

[26]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[27]  Umut Isik,et al.  Attention Wave-U-Net for Speech Enhancement , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[28]  Marc Delcroix,et al.  Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Xiuhua Geng,et al.  A signal subspace approach for speech enhancement , 2014 .

[30]  Rohit M. Thanki,et al.  Speech Enhancement Techniques for Digital Hearing Aids , 2018 .

[31]  Mark Hasegawa-Johnson,et al.  Speech Enhancement Using Bayesian Wavenet , 2017, INTERSPEECH.

[32]  W. Marsden I and J , 2012 .

[33]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[34]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[35]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[36]  Simon King,et al.  The voice bank corpus: Design, collection and data analysis of a large regional accent speech database , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Maarten De Vos,et al.  Improving GANs for Speech Enhancement , 2020, IEEE Signal Processing Letters.

[39]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[41]  Nobutaka Ito,et al.  The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings , 2013 .