MIRNet: Learning multiple identities representations in overlapped speech

Many approaches can derive information about a single speaker's identity from the speech by learning to recognize consistent characteristics of acoustic parameters. However, it is challenging to determine identity information when there are multiple concurrent speakers in a given signal. In this paper, we propose a novel deep speaker representation strategy that can reliably extract multiple speaker identities from an overlapped speech. We design a network that can extract a high-level embedding that contains information about each speaker's identity from a given mixture. Unlike conventional approaches that need reference acoustic features for training, our proposed algorithm only requires the speaker identity labels of the overlapped speech segments. We demonstrate the effectiveness and usefulness of our algorithm in a speaker verification task and a speech separation system conditioned on the target speaker embeddings obtained through the proposed method.

[1]  Souvik Kundu,et al.  Speaker-aware training of LSTM-RNNS for acoustic modelling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Thierry Dutoit,et al.  Speaker-aware long short-term memory multi-task learning for speech recognition , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).

[3]  Hoirin Kim,et al.  Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances , 2020, INTERSPEECH.

[4]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[5]  Frank Rudzicz,et al.  Centroid-based Deep Metric Learning for Speaker Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  B. Atal Automatic Speaker Recognition Based on Pitch Contours , 1969 .

[7]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  William M. Campbell,et al.  Support vector machines for speaker verification and identification , 2000, Neural Networks for Signal Processing X. Proceedings of the 2000 IEEE Signal Processing Society Workshop (Cat. No.00TH8501).

[9]  J. Wolf Efficient Acoustic Parameters for Speaker Recognition , 1972 .

[10]  Bowen Zhou,et al.  Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Chng Eng Siong,et al.  Time-Domain Speaker Extraction Network , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Aaron Lawson,et al.  The Speakers in the Wild (SITW) Speaker Recognition Database , 2016, INTERSPEECH.

[14]  Aaron E. Rosenberg,et al.  Report: A vector quantization approach to speaker recognition , 1987, AT&T Technical Journal.

[15]  C.-S. Liu,et al.  Study of line spectrum pair frequencies for speaker recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[17]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[18]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Thomas Hain,et al.  Supervised Speaker Embedding De-Mixing in Two-Speaker Environment , 2020, ArXiv.

[23]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[25]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[26]  John R. Hershey,et al.  VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.

[27]  Hoirin Kim,et al.  Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs , 2020, INTERSPEECH.

[28]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[29]  Tomohiro Nakatani,et al.  Single Channel Target Speaker Extraction and Recognition with Speaker Beam , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Joon Son Chung,et al.  In defence of metric learning for speaker recognition , 2020, INTERSPEECH.

[31]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..