论文信息 - Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models

Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models

Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classification, acoustic model, speaker embedding, emotion classification, and two self-supervised speech encoders (PASE+, wav2vec 2.0). We first build a strong baseline (w/o PERL) using Conformer Transformer Networks on the popular enhancement benchmark called VCTK-DEMAND. Using auxiliary models one at a time, we find acoustic event and self-supervised model PASE+ to be most effective. Our best model (PERL-AE) only uses acoustic event model (utilizing AudioSet) to outperform state-of-the-art methods on major perceptual metrics. To explore if denoising can leverage full framework, we use all networks but find that our seven-loss formulation suffers from the challenges of Multi-Task Learning. Finally, we report a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.

[1] Joan Serra,et al. SESQA: semi-supervised learning for speech quality assessment , 2020, ArXiv.

[2] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[3] Yoshua Bengio,et al. Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Hye-jin Shim,et al. Improved RawNet with Filter-wise Rescaling for Text-independent Speaker Verification using Raw Waveforms , 2020, ArXiv.

[5] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Kandarpa Kumar Sarma,et al. Emotion Identification from Raw Speech Signals Using DNNs , 2018, INTERSPEECH.

[7] Jungwon Lee,et al. T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Vladlen Koltun,et al. Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[9] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Junichi Yamagishi,et al. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech , 2016, SSW.

[12] Hye-jin Shim,et al. Improved RawNet with Feature Map Scaling for Text-Independent Speaker Verification Using Raw Waveforms , 2020, INTERSPEECH.

[13] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14] Angel Manuel Gomez,et al. A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality , 2018, IEEE Signal Processing Letters.

[15] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[16] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[17] Razvan Pascanu,et al. Adapting Auxiliary Losses Using Gradient Similarity , 2018, ArXiv.

[18] Nicholas J. Bryan,et al. A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences , 2020, Interspeech.

[19] Andrew J. Davison,et al. End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[21] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[22] Roberto Cipolla,et al. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Marc Delcroix,et al. Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] A. Finkelstein,et al. HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks , 2020, INTERSPEECH.

[25] Theo Gevers,et al. Multi-Loss Weighting with Coefficient of Variations , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26] Jes'us Villalba,et al. Analysis of Deep Feature Loss based Enhancement for Speaker Verification , 2020, ArXiv.