S-Vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer’s encoder trained for speaker classification. Self-attention, on which Transformer’s encoder is built, attends to all the features over the entire utterance and might be more suitable in capturing the speaker characteristics in an utterance. We refer to the speaker embeddings obtained from the proposed speaker classification model as s-vectors to emphasize that they are obtained from an architecture that heavily relies on self-attention. Through experiments, we demonstrate that s-vectors perform better than x-vectors. In addition to the s-vectors, we also propose a new architecture based on Transformer’s encoder for speaker verification as a replacement for speaker verification based on conventional probabilistic linear discriminant analysis (PLDA). This architecture is inspired by the next sentence prediction task of bidirectional encoder representations from Transformers (BERT), and we feed the s-vectors of two utterances to verify whether they belong to the same speaker. We name this architecture the Transformer encoder speaker authenticator (TESA). Our experiments show that the performance of s-vectors with TESA is better than s-vectors with conventional PLDA-based speaker verification.

[1]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[2]  Shang-Wen Li,et al.  TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Joon Son Chung,et al.  Voxceleb: Large-scale speaker verification in the wild , 2020, Comput. Speech Lang..

[4]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[6]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[7]  Chao Zhang,et al.  Speaker Diarisation Using 2D Self-attentive Combination of Embeddings , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Daniel Povey,et al.  Self-Attentive Speaker Embeddings for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[10]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[11]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[12]  Man-Wai Mak,et al.  Wav2Spk: A Simple DNN Architecture for Learning Speaker Embeddings from Waveforms , 2020, INTERSPEECH.

[13]  Joon Son Chung,et al.  Utterance-level Aggregation for Speaker Recognition in the Wild , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Srinivasan Umesh,et al.  Investigation of Methods to Improve the Recognition Performance of Tamil-English Code-Switched Data in Transformer Framework , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Koichi Shinoda,et al.  Attentive Statistics Pooling for Deep Speaker Embedding , 2018, INTERSPEECH.

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Speaker Verification System Based on Deformable CNN and Time-Frequency Attention , 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[19]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[20]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Shujie Liu,et al.  Neural Speech Synthesis with Transformer Network , 2018, AAAI.

[22]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Kris Demuynck,et al.  ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification , 2020, INTERSPEECH.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Chao Zhang,et al.  The JD AI Speaker Verification System for the FFSVC 2020 Challenge , 2020, INTERSPEECH.

[28]  Jianhua Tao,et al.  Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification , 2021, 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[29]  Vishwas M. Shetty,et al.  Improving the Performance of Transformer Based Low Resource Speech Recognition for Indian Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Pooyan Safari,et al.  Self-attention encoding and pooling for speaker recognition , 2020, INTERSPEECH.

[34]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Jian Cheng,et al.  Additive Margin Softmax for Face Verification , 2018, IEEE Signal Processing Letters.