DNN i-vector based Fishervoice and PLDA SVM scoring for NIST SRE 2016

Our ongoing work that applies Fishervoice to map joint factor analysis (JFA)-mean supervectors 1 into a compressed discriminant subspace has shown that performing cosine distance scoring on the Fishervoice projected vectors outperforms classical JFA. In this paper, we refine Fishervoice for low-dimensional i-vectors by only using the nonparametric between-class scatter matrix to substitute the parametric one in linear discriminative analysis (LDA). The task of 2016 speaker recognition evaluation (SRE16) only has unlabeled in-domain training data and labeled out-of-domain training data for model training. Support vector machine (SVM) scoring can capture the discriminative information embedded in the unlabeled in-domain training data. We perform probabilistic linear discriminant analysis (PLDA) before SVM scoring for inter-session compensation with speaker label information from out-of-domain training data. This approach constitutes CUHK’s submission for SRE16. In this paper, we present a detailed analysis of the approaches and the performance gains with refined Fishervoice and PLDA SVM scoring.1The JFA-mean supervector of an utterance is a GMM supervector obtained from the JFA model.

[1]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[2]  Man-Wai Mak,et al.  Utterance partitioning with acoustic vector resampling for GMM-SVM speaker verification , 2011, Speech Commun..

[3]  Pietro Laface,et al.  Pairwise Discriminative Speaker Verification in the ${\rm I}$-Vector Space , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[5]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[6]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[7]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Frank K. Soong,et al.  DNN i-Vector Speaker Verification with Short, Text-Constrained Test Utterances , 2017, INTERSPEECH.

[12]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[13]  Zhifeng Li,et al.  Fishervioce: A discriminant subspace framework for speaker recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  John H. L. Hansen,et al.  A fast speaker verification with universal background support data selection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Man-Wai Mak,et al.  Likelihood-ratio empirical kernels for i-vector based PLDA-SVM scoring , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Eero P. Simoncelli,et al.  Nonlinear Extraction of Independent Components of Natural Images Using Radial Gaussianization , 2009, Neural Computation.

[17]  Man-Wai Mak,et al.  Boosting the Performance of I-Vector Based Speaker Verification via Utterance Partitioning , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[19]  Niko Brümmer,et al.  The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF , 2013, ArXiv.

[20]  Na Li,et al.  An Integration of Random Subspace Sampling and Fishervoice for Speaker Verification , 2014, Odyssey.

[21]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[22]  Seyed Omid Sadjadi,et al.  The IBM 2016 Speaker Recognition System , 2016, Odyssey.

[23]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[24]  Man-Wai Mak,et al.  PLDA modeling in the fishervoice subspace for speaker verification , 2014, INTERSPEECH.

[25]  Seyed Omid Sadjadi,et al.  Nearest neighbor discriminant analysis for robust speaker recognition , 2014, INTERSPEECH.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Dahua Lin,et al.  Nonparametric Discriminant Analysis for Face Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[29]  Zhifeng Li,et al.  An enhanced Fishervoice subspace framework for text-independent speaker verification , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[30]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[31]  李志锋 An Analysis Framework based on Random Subspace Sampling for Speaker Verification , 2011 .

[32]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.