Speaker Embedding Extraction with Phonetic Information

Speaker embeddings achieve promising results on many speaker verification tasks. Phonetic information, as an important component of speech, is rarely considered in the extraction of speaker embeddings. In this paper, we introduce phonetic information to the speaker embedding extraction based on the x-vector architecture. Two methods using phonetic vectors and multi-task learning are proposed. On the Fisher dataset, our best system outperforms the original x-vector approach by 20% in EER, and by 15%, 15% in minDCF08 and minDCF10, respectively. Experiments conducted on NIST SRE10 further demonstrate the effectiveness of the proposed methods.

[1]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[3]  Dong Wang,et al.  Deep Factorization for Speech Signal , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[5]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[6]  Dong Wang,et al.  Deep Speaker Feature Learning for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[7]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  Dong Wang,et al.  Collaborative Joint Training With Multitask Recurrent Model for Speech and Speaker Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Themos Stafylakis,et al.  JFA for speaker recognition with random digit strings , 2015, INTERSPEECH.

[11]  Alvin F. Martin,et al.  NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels , 2009, INTERSPEECH.

[12]  David Miller,et al.  The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text , 2004, LREC.

[13]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Themos Stafylakis,et al.  Speaker and Channel Factors in Text-Dependent Speaker Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[16]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[17]  Kai Yu,et al.  Multi-task learning for text-dependent speaker verification , 2015, INTERSPEECH.

[18]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[19]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Andrew W. Senior,et al.  Improving DNN speaker independence with I-vector inputs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sanjeev Khudanpur,et al.  Deep Neural Network Embeddings for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[22]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[23]  George Saon,et al.  Speaker adaptation of neural network acoustic models using i-vectors , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[24]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[25]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Quan Wang,et al.  Generalized End-to-End Loss for Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Thierry Dutoit,et al.  Speaker-aware long short-term memory multi-task learning for speech recognition , 2016, 2016 24th European Signal Processing Conference (EUSIPCO).