Scoring Heterogeneous Speaker Vectors Using Nonlinear Transformations and Tied PLDA Models

Most current state-of-the-art text-independent speaker recognition systems are based on i-vectors, and on probabilistic linear discriminant analysis (PLDA). PLDA assumes that the i-vectors of a trial are homogeneous, i.e., that they have been extracted by the same system. In other words, the enrollment and test i-vectors belong to the same class. However, it is sometimes important to score trials including “heterogeneous” i-vectors, for instance, enrollment i-vectors extracted by an old system, and test i-vectors extracted by a newer, more accurate, system. In this paper, we introduce a PLDA model that is able to score heterogeneous i-vectors independent of their extraction approach, dimensions, and any other characteristics that make a set of i-vectors of the same speaker belong to different classes. The new model, which will be referred to as nonlinear tied-PLDA (NL-Tied-PLDA), is obtained by a generalization of our recently proposed nonlinear PLDA approach, which jointly estimates the PLDA parameters and the parameters of a nonlinear transformation of the i-vectors. The generalization consists of estimating a class-dependent nonlinear transformation of the i-vectors, with the constraint that the transformed i-vectors of the same speaker share the same speaker factor. The resulting model is flexible and accurate, as assessed by the results of a set of experiments performed on the extended core NIST SRE 2012 evaluation. In particular, NL-Tied-PLDA provides better results on heterogeneous trials with respect to the corresponding homogeneous trials scored by the old system, and, in some configurations, it also reaches the accuracy of the new system. Similar results were obtained on the female-extended core NIST SRE 2010 telephone condition.

[1]  Driss Matrouf,et al.  Additive noise compensation in the i-vector space for speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[3]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[4]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[5]  Pietro Laface,et al.  Nonlinear I-Vector Transformations for PLDA-Based Speaker Recognition , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Hagai Aronowitz,et al.  Audio enhancing with DNN autoencoder for speaker recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Tomi Kinnunen,et al.  From single to multiple enrollment i-vectors: Practical PLDA scoring variants for speaker verification , 2014, Digit. Signal Process..

[8]  Jonathan Warrell,et al.  Tied Factor Analysis for Face Recognition across Large Pose Differences , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Pietro Laface,et al.  Speaker recognition by means of acoustic and phonetically informed GMMs , 2015, INTERSPEECH.

[10]  Pietro Laface,et al.  I-vector transformation and scaling for PLDA based speaker recognition , 2016, Odyssey.

[11]  Lukás Burget,et al.  Migrating i-vectors between speaker recognition systems using regression neural networks , 2015, INTERSPEECH.

[12]  Yun Lei,et al.  A noise robust i-vector extractor using vector taylor series for speaker recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Pietro Laface,et al.  e-vectors: JFA and i-vectors revisited , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Driss Matrouf,et al.  Probabilistic Approach Using Joint Long and Short Session i-Vectors Modeling to Deal with Short Utterances for Speaker Recognition , 2016, INTERSPEECH.

[15]  Vidhyasaharan Sethu,et al.  Duration compensation of i-vectors for short duration speaker verification , 2017 .

[16]  Arthur Pewsey,et al.  Skew t distributions via the sinh-arcsinh transformation , 2011 .

[17]  Yun Lei,et al.  Towards noise-robust speaker recognition using probabilistic linear discriminant analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[20]  Pietro Laface,et al.  Probabilistic linear discriminant analysis of i-vector posterior distributions , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Pietro Laface,et al.  Pairwise Discriminative Speaker Verification in the ${\rm I}$-Vector Space , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[24]  Patrick Kenny,et al.  An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech , 2010, Odyssey.

[25]  Daniel Garcia-Romero,et al.  Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  M. C. Jones,et al.  Sinh-arcsinh distributions , 2009 .

[27]  Pietro Laface,et al.  On the use of i–vector posterior distributions in Probabilistic Linear Discriminant Analysis , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Alan McCree,et al.  Insights into deep neural networks for speaker recognition , 2015, INTERSPEECH.

[29]  Aleksandr Sizov,et al.  Comparison between supervised and unsupervised learning of probabilistic linear discriminant analysis mixture models for speaker verification , 2013, Pattern Recognit. Lett..

[30]  Pietro Laface,et al.  Joint Estimation of PLDA and Nonlinear Transformations of Speaker Vectors , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[32]  Driss Matrouf,et al.  Probabilistic Approach Using Joint Clean and Noisy i-Vectors Modeling for Speaker Recognition , 2016, INTERSPEECH.

[33]  Jen-Tzung Chien,et al.  Mixture of PLDA for Noise Robust I-Vector Speaker Verification , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Pietro Laface,et al.  Speaker Recognition Using e–Vectors , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Sandro Cumani Fast Scoring of Full Posterior PLDA Models , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[36]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[37]  Umar Mohammed,et al.  Probabilistic Models for Inference about Identity , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  John H. L. Hansen,et al.  Duration mismatch compensation for i-vector based speaker recognition systems , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[39]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[41]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[42]  Yun Lei,et al.  Unscented transform for ivector-based noisy speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[44]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Kong-Aik Lee,et al.  Twin Model G-PLDA for Duration Mismatch Compensation in Text-Independent Speaker Verification , 2016, INTERSPEECH.