DNN i-Vector Speaker Verification with Short, Text-Constrained Test Utterances

We investigate how to improve the performance of DNN ivector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification, due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with “phonetically aware” Deep Neural Net (DNN) in its capability on “stochastic phonetic-alignment” in constructing supervectors and estimating the corresponding i-vectors with two speech databases: a large vocabulary, conversational, speaker independent database (Fisher) and a small vocabulary, continuous digit database (RSR2015 Part III). The phonetic alignment efficiency and resultant speaker verification performance are compared with differently sized senone sets which can characterize the phonetic pronunciations of utterances in the two databases. Performance on RSR2015 Part III evaluation shows a relative improvement of EER, i.e., 7.89% for male speakers and 3.54% for female speakers with only digit related senones. The DNN bottleneck features were also studied to investigate their capability of extracting phonetic sensitive information which is useful for text-independent or textconstrained speaker verifications. We found that by tandeming MFCC with bottleneck features, EERs can be further reduced.

[1]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[2]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Bin Ma,et al.  Phone-centric local variability vector for text-constrained speaker verification , 2015, INTERSPEECH.

[4]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  William M. Campbell,et al.  Text-Independent Speaker Recognition , 2008 .

[6]  Brad H. Story,et al.  USING IMAGING AND MODELING TECHNIQUES TO UNDERSTAND THE RELATION BETWEEN VOCAL TRACT SHAPE TO ACOUSTIC CHARACTERISTICS , 2003 .

[7]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[9]  Ning Chen,et al.  Feature sparsity analysis for i-vector based speaker verification , 2016, Speech Commun..

[10]  Themos Stafylakis,et al.  Text-Dependent Speaker Recognition With Random Digit Strings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  P. Kenny,et al.  I-Vector / PLDA Variants for Text-Dependent Speaker Recognition , 2013 .

[12]  Bin Ma,et al.  Content-aware local variability vector for speaker verification with short utterance , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Matthieu Hébert,et al.  Text-Dependent Speaker Recognition , 2008 .

[14]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Daniel Garcia-Romero,et al.  Linear versus mel frequency cepstral coefficients for speaker recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[17]  Claude Barras,et al.  Combination of Cepstral and Phonetically Discriminative Features for Speaker Verification , 2014, IEEE Signal Processing Letters.

[18]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[19]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.