Improving Generalizability of Distilled Self-Supervised Speech Processing Models Under Distorted Settings

Self-supervised learned (SSL) speech pre-trained models perform well across various speech processing tasks. Distilled versions of SSL models have been developed to match the needs of on-device speech applications. Though having similar performance as original SSL models, distilled counterparts suffer from performance degradation even more than their original versions in distorted environments. This paper proposes to apply Cross-Distortion Mapping and Domain Adversarial Training to SSL models during knowledge distillation to alleviate the performance gap caused by the domain mismatch problem. Results show consistent performance improvements under both in- and out-of-domain distorted setups for different downstream tasks while keeping efficient model size.

[1]  Karl El Hajal,et al.  BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping , 2022, HEAR@NeurIPS.

[2]  Lirong Dai,et al.  Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR , 2022, ArXiv.

[3]  Sergiy Matusevych,et al.  Icassp 2022 Deep Noise Suppression Challenge , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Tara N. Sainath,et al.  Self-Supervised Speech Representation Learning: A Review , 2022, IEEE Journal of Selected Topics in Signal Processing.

[5]  Hung-yi Lee,et al.  Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation , 2022, Interspeech.

[6]  Michael Auli,et al.  data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[7]  Li-Rong Dai,et al.  A Noise-Robust Self-Supervised Pre-Training Model Based Speech Representation Learning for Automatic Speech Recognition , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  DeLiang Wang,et al.  Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jinyu Li,et al.  Wav2vec-Switch: Contrastive Learning from Original-Noisy Speech Pairs for Robust Speech Recognition , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Hung-yi Lee,et al.  Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  X. Serra,et al.  FSD50K: An Open Dataset of Human-Labeled Sound Events , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[14]  Gabriel Synnaeve,et al.  Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training , 2021, Interspeech.

[15]  K. Kashino,et al.  BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation , 2021, IEEE International Joint Conference on Neural Network.

[16]  Duen Horng Chau,et al.  Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning , 2021, Interspeech.

[17]  Andreas Stolcke,et al.  REDAT: Accent-Invariant Representation for End-To-End ASR by Domain Adversarial Training with Relabeling , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Matthijs Douze,et al.  Data Augmenting Contrastive Learning of Speech Representations in the Time Domain , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[20]  Joon Son Chung,et al.  Augmentation adversarial training for unsupervised speaker recognition , 2020, ArXiv.

[21]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[22]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[23]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[24]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[25]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[26]  Yu Tsao,et al.  Noise Adaptive Speech Enhancement using Domain Adversarial Training , 2018, INTERSPEECH.

[27]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[28]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[29]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[30]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[32]  Jont B. Allen,et al.  Image method for efficiently simulating small‐room acoustics , 1976 .