On the Design of Deep Priors for Unsupervised Audio Restoration

Unsupervised deep learning methods for solving audio restoration problems extensively rely on carefully tailored neural architectures that carry strong inductive biases for defining priors in the time or spectral domain. In this context, lot of recent success has been achieved with sophisticated convolutional network constructions that recover audio signals in the spectral domain. However, in practice, audio priors require careful engineering of the convolutional kernels to be effective at solving ill-posed restoration tasks, while also being easy to train. To this end, in this paper, we propose a new U-Net based prior that does not impact either the network complexity or convergence behavior of existing convolutional architectures, yet leads to significantly improved restoration. In particular, we advocate the use of carefully designed dilation schedules and dense connections in the U-Net architecture to obtain powerful audio priors. Using empirical studies on standard benchmarks and a variety of ill-posed restoration tasks, such as audio denoising, in-painting and source separation, we demonstrate that our proposed approach consistently outperforms widely adopted audio prior architectures.

[1]  Kunio Kashino,et al.  Harmonic Lowering for Accelerating Harmonic Convolution for Audio Signals , 2020, INTERSPEECH.

[2]  Michal Irani,et al.  “Double-DIP”: Unsupervised Image Decomposition via Coupled Deep-Image-Priors , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Nicolas Usunier,et al.  Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed , 2019, ArXiv.

[4]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[6]  Chuang Gan,et al.  Deep Audio Priors Emerge From Harmonic Convolutional Networks , 2020, ICLR.

[7]  Deepta Rajan,et al.  DDxNet: a deep learning model for automatic interpretation of electronic health records, electrocardiograms and electroencephalograms , 2020, Scientific Reports.

[8]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[9]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[10]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[11]  Umut Isik,et al.  Attention Wave-U-Net for Speech Enhancement , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[12]  Rushil Anirudh,et al.  Unsupervised Audio Source Separation using Generative Priors , 2020, INTERSPEECH.

[13]  Chris Donahue,et al.  Adversarial Audio Synthesis , 2018, ICLR.

[14]  Hung-yi Lee,et al.  Deep Long Audio Inpainting , 2019, ArXiv.

[15]  Chinmay Hegde,et al.  Solving Linear Inverse Problems Using Gan Priors: An Algorithm with Provable Guarantees , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Antoine Liutkus,et al.  The 2018 Signal Separation Evaluation Campaign , 2018, LVA/ICA.

[17]  Tuomas Virtanen,et al.  Sound Source Separation Using Sparse Coding with Temporal Continuity Objective , 2003, ICMC.

[18]  Nuno Vasconcelos,et al.  Self-Supervised Generation of Spatial Audio for 360 Video , 2018, NIPS 2018.

[19]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[20]  Antonio Bonafonte,et al.  SEGAN: Speech Enhancement Generative Adversarial Network , 2017, INTERSPEECH.

[21]  Ted Painter,et al.  Audio Signal Processing and Coding , 2007 .

[22]  Volker Gnann SOURCE-FILTER BASED CLUSTERING FOR MONAURAL BLIND SOURCE SEPARATION , 2009 .