Deep neural network-based speaker embeddings for end-to-end speaker verification

In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.

[1]  Daniel Garcia-Romero,et al.  Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sanjeev Khudanpur,et al.  Acoustic Modelling from the Signal Domain Using CNNs , 2016, INTERSPEECH.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Xiaohui Zhang,et al.  Parallel training of Deep Neural Networks with Natural Gradient and Parameter Averaging , 2014, ICLR.

[5]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[8]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[10]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Niko Brümmer,et al.  Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance , 2011, INTERSPEECH.

[12]  Ahmad Salman,et al.  Learning Speaker-Specific Characteristics With a Deep Neural Architecture , 2011, IEEE Transactions on Neural Networks.

[13]  Richard C. Rose,et al.  Deep bottleneck features for i-vector based text-independent speaker verification , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[14]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[15]  Longbiao Wang,et al.  Improvement of distant-talking speaker identification using bottleneck features of DNN , 2013, INTERSPEECH.

[16]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[17]  Mitch Weintraub,et al.  NONLINEAR DISCRIMINANT FEATURE EXTRACTION FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION , 1997 .

[18]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Larry P. Heck,et al.  Robustness to telephone handset distortion in speaker recognition by discriminative feature design , 2000, Speech Commun..

[22]  Ke Chen,et al.  Extracting Speaker-Specific Information with a Regularized Siamese Deep Network , 2011, NIPS.

[23]  Lukás Burget,et al.  Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[26]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[27]  Lukás Burget,et al.  Discriminatively Trained i-vector Extractor for Speaker Verification , 2011, INTERSPEECH.