End-to-end text-independent speaker verification with flexibility in utterance duration

We continue to investigate end-to-end text-independent speaker verification by incorporating the variability from different utterance durations. Our previous study [1] showed a competitive performance with a triplet loss based end-to-end text-independent speaker verification system. To normalize the duration variability, we provided fixed length inputs to the network by a simple cropping or padding operation. Those operations do not seem ideal, particularly for long duration where some amount of information is discarded, while an i-vector system typically has improved accuracy with an increase in input duration. In this study, we propose to replace the final max/average pooling layer with a Spatial Pyramid Pooling layer in the Inception-Resnet-v1 architecture, which allows us to relax the fixed-length input constraint and train the entire network with the arbitrary size of input in an end-to-end fashion. In this way, the modified network can map variable length utterances into fixed length embeddings. Experiments shows that the new end-to-end system with variable size input relatively reduces EER by 8.4% over the end-to-end system with fixed-length input, and 24.0% over the i-vector/PLDA baseline system. an end-to-end system with.

[1]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[2]  Yu Qiao,et al.  A Discriminative Feature Learning Approach for Deep Face Recognition , 2016, ECCV.

[3]  John H. L. Hansen,et al.  I-vector based physical task stress detection with different fusion strategies , 2015, INTERSPEECH.

[4]  John H. L. Hansen,et al.  UTD-CRSS Systems for 2018 NIST Speaker Recognition Evaluation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[6]  Thomas Fang Zheng,et al.  Improving Short Utterance Speaker Recognition by Modeling Speech Unit Classes , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[8]  John H. L. Hansen,et al.  An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing , 2017, IEEE Journal of Selected Topics in Signal Processing.

[9]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[10]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  David A. van Leeuwen,et al.  Quality Measure Functions for Calibration of Speaker Recognition Systems in Various Duration Conditions , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  John H. L. Hansen,et al.  UTD-CRSS system for the NIST 2015 language recognition i-vector machine learning challenge , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Liu Gang,et al.  Joint information from nonlinear and linear features for spoofing detection: An i-vector/DNN based approach , 2016 .

[17]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[20]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[22]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Seyed Omid Sadjadi,et al.  The IBM Speaker Recognition System: Recent Advances and Error Analysis , 2016, INTERSPEECH.

[25]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).