Continual Self-Supervised Domain Adaptation for End-to-End Speaker Diarization

In conventional domain adaptation for speaker diarization, a large collection of annotated conversations from the target domain is required. In this work, we propose a novel continual training scheme for domain adaptation of an end-to-end speaker diarization system, which processes one conversation at a time and benefits from full self-supervision thanks to pseudo-labels. The qualities of our method allow for autonomous adaptation (e.g. of a voice assistant to a new house-hold), while also avoiding permanent storage of possibly sensitive user conversations. We experiment extensively on the 11 domains of the DIHARD III corpus and show the effectiveness of our approach with respect to a pre-trained base-line, achieving a relative 17% performance improvement. We also find that data augmentation and a well-defined target domain are key factors to avoid divergence and to benefit from transfer.

[1]  Buye Xu,et al.  Continual Self-Training With Bootstrapped Remixing For Speech Enhancement , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Shinji Watanabe,et al.  Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization , 2021, Interspeech.

[3]  Antoine Laurent,et al.  End-to-end speaker segmentation for overlap-aware resegmentation , 2021, Interspeech.

[4]  Kenneth Ward Church,et al.  The Third DIHARD Diarization Challenge , 2020, Interspeech.

[5]  Luk'avs Burget,et al.  Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks , 2020, Comput. Speech Lang..

[6]  Andrei A. Rusu,et al.  Embracing Change: Continual Learning in Deep Neural Networks , 2020, Trends in Cognitive Sciences.

[7]  Shinji Watanabe,et al.  End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors , 2020, INTERSPEECH.

[8]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[9]  Vincenzo Lomonaco,et al.  Latent Replay for Real-Time Continual Learning , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Pavel Korshunov,et al.  Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  James R. Glass,et al.  Generative Pre-Training for Speech with Autoregressive Predictive Coding , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Awni Y. Hannun,et al.  Self-Training for End-to-End Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yee Whye Teh,et al.  Continual Unsupervised Representation Learning , 2019, NeurIPS.

[14]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[18]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[19]  Josh H McDermott,et al.  Statistics of natural reverberation enable perceptual separation of sound and space , 2016, Proceedings of the National Academy of Sciences.

[20]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Dong-Hyun Lee,et al.  Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks , 2013 .

[23]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[24]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[27]  Anthony V. Robins,et al.  Catastrophic Forgetting, Rehearsal and Pseudorehearsal , 1995, Connect. Sci..