Multi-Task Neural Network for Robust Multiple Speaker Embedding Extraction

This paper introduces a novel approach for extracting speaker embeddings from audio mixtures of multiple overlapping voices. This approach is based on a multi-task neural network. The network first extracts a latent feature for each direction. This feature is used for detecting sound sources as well as identifying speakers. In contrast to traditional approaches, the proposed method does not rely on explicit sound source separation. The neural network model learns from data to extract the most suitable features of the sounds at different directions. The experiments using audio recordings of overlapping sound sources show that the proposed approach outperforms a beamformingbased traditional method.

[1]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  AG Armin Kohlrausch,et al.  Binaural Localization and Detection of Speakers in Complex Acoustic Scenes , 2013 .

[4]  DeLiang Wang,et al.  Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Petr Motlícek,et al.  Exploiting sequence information for text-dependent Speaker Verification , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Petr Motlícek,et al.  DNN Based Speaker Embedding Using Content Information for Text-Dependent Speaker Verification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Petr Motlícek,et al.  Joint Localization and Classification of Multiple Sound Sources Using a Multi-task Neural Network , 2018, INTERSPEECH.

[10]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[12]  DeLiang Wang,et al.  CASA-Based Robust Speaker Identification , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Petr Motlícek,et al.  Deep Neural Networks for Multiple Speaker Detection and Localization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[15]  Petr Motlícek,et al.  End-to-end Text-dependent Speaker Verification Using Novel Distance Measures , 2018, INTERSPEECH.

[16]  Jean-Marc Odobez,et al.  Robust and Discriminative Speaker Embedding via Intra-Class Distance Variance Regularization , 2018, INTERSPEECH.

[17]  Petr Motlícek,et al.  Adaptation of Multiple Sound Source Localization Neural Networks with Weak Supervision and Domain-adversarial Training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  J. Capon High-resolution frequency-wavenumber spectrum analysis , 1969 .

[19]  Jean-Marc Odobez,et al.  Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[22]  DeLiang Wang,et al.  Robust Speaker Identification in Noisy and Reverberant Conditions , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[24]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[25]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[26]  Hugo Van hamme,et al.  Joint Sound Source Separation and Speaker Recognition , 2016, INTERSPEECH.

[27]  Reinhold Häb-Umbach,et al.  Deep Attractor Networks for Speaker Re-Identification and Blind Source Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).