Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering

Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method that clusters speech representations and performs swapped prediction between the original and speaker-perturbed utterances. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU. Spin improves pre-trained networks and outperforms prior methods in speech recognition and acoustic unit discovery.

[1]  Hao Tang,et al.  Phonetic Analysis of Self-supervised Representations of English Speech , 2022, INTERSPEECH.

[2]  Kyunghyun Cho,et al.  Towards Disentangled Speech Representations , 2022, INTERSPEECH.

[3]  Shinji Watanabe,et al.  An Exploration of Hubert with Large Number of Cluster Units and Model Assessment Using Bayesian Information Criterion , 2022, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Tara N. Sainath,et al.  Self-Supervised Speech Representation Learning: A Review , 2022, IEEE Journal of Selected Topics in Signal Processing.

[5]  David Chan,et al.  Content-Context Factorized Representations for Automated Speech Recognition , 2022, INTERSPEECH.

[6]  M. Hasegawa-Johnson,et al.  ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers , 2022, ICML.

[7]  Furu Wei,et al.  Speech Pre-training with Acoustic Piece , 2022, INTERSPEECH.

[8]  Hung-yi Lee,et al.  Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation , 2022, Interspeech.

[9]  Hao Tang,et al.  Autoregressive Co-Training for Learning Discrete Speech Representations , 2022, Interspeech.

[10]  Andy T. Liu,et al.  SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities , 2022, ACL.

[11]  Benoît Sagot,et al.  Are Discrete Units Necessary for Spoken Language Modeling? , 2022, IEEE Journal of Selected Topics in Signal Processing.

[12]  Michael Auli,et al.  data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[13]  Yonghui Wu,et al.  Self-supervised Learning with Random-projection Quantizer for Speech Recognition , 2022, ICML.

[14]  Jinyu Li,et al.  WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , 2021, IEEE Journal of Selected Topics in Signal Processing.

[15]  Hung-yi Lee,et al.  Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Juheon Lee,et al.  Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations , 2021, NeurIPS.

[17]  Yu Tsao,et al.  An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Hung-yi Lee,et al.  Mandarin-English Code-switching Speech Recognition with Self-supervised Speech Representation Models , 2021, ArXiv.

[19]  Chung-Cheng Chiu,et al.  w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Karen Livescu,et al.  Layer-Wise Analysis of a Self-Supervised Speech Representation Model , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[21]  Jan Chorowski,et al.  Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw , 2021, Interspeech.

[22]  Ruslan Salakhutdinov,et al.  HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Andy T. Liu,et al.  SUPERB: Speech processing Universal PERformance Benchmark , 2021, Interspeech.

[24]  Marcely Zanon Boito,et al.  LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech , 2021, Interspeech.

[25]  Yu Zhang,et al.  Unsupervised Learning of Disentangled Speech Content and Style Representation , 2020, Interspeech.

[26]  Ewan Dunbar,et al.  The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling , 2020, ArXiv.

[27]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[28]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[29]  James R. Glass,et al.  Vector-Quantized Autoregressive Predictive Coding , 2020, INTERSPEECH.

[30]  Yuki M. Asano,et al.  Self-labelling via simultaneous clustering and representation learning , 2019, ICLR.

[31]  K. Stevens RELATIONAL PROPERTIES AS PERCEPTUAL CORRELATES OF PHONETIC FEATURES , 2019 .

[32]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[33]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Thomas Schatz ABX-Discriminability Measures and Applications , 2016 .

[36]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[38]  Herbert Gish,et al.  A parametric approach to vocal tract length normalization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.