Analysis of Posterior Estimation Approaches to I-vector Extraction for Speaker Recognition

The i-vector approach to speaker recognition requires estimating Sufficient Statistics (SS) (i.e., zerothand firstorder statistics) for a given utterance of speech with respect to a Universal Background Model (UBM) usually represented by Gaussian Mixture Models (GMM). To estimate SS, alternate approaches have also been experimented. Studies suggest that using acoustic phone posteriors estimated from a Deep Neural Network (DNN) based Automatic Speech Recognition (ASR) system can be useful in estimating accurate speaker representations with ivectors. In this paper, we analyze and compare the UBM-GMM and several versions of DNN approaches together with subspace Gaussian Mixture Models to estimate i-vectors for a speaker. We show that better alignments of speech frames can lead to superior speaker verification performance. This is achieved through the use of the decoded output from the ASR system, whereas existing systems only use posteriors at the output of the DNN directly. The posteriors from the decoding lattices are rescaled suitably to deal with its sparse nature that can affect SS computation. We show that a direct correlation exists between senone recognition accuracy of the system generating the posterior and the performance of corresponding speaker recognition systems. The posterior estimation methods are compared on standard NIST 2010 SRE dataset. Significant improvements are obtained when using the ASR decoder, thereby confirming that with better frame-level alignments speaker verification performance improves. Equal Error Rate (EER) as low as 0.9% is achieved on the telephone condition of the evaluation set.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[3]  George R. Doddington,et al.  Speaker recognition based on idiolectal differences between speakers , 2001, INTERSPEECH.

[4]  Sridha Sridharan,et al.  Feature warping for robust speaker verification , 2001, Odyssey.

[5]  Philip Rose Forensic Speaker Identification , 2002 .

[6]  Douglas A. Reynolds,et al.  Conditional pronunciation modeling in speaker detection , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  William M. Campbell,et al.  Phonetic Speaker Recognition with Support Vector Machines , 2003, NIPS.

[8]  Patrick Kenny,et al.  Eigenvoice modeling with sparse training data , 2005, IEEE Transactions on Speech and Audio Processing.

[9]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[10]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[11]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[12]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[13]  Chin-Hui Lee,et al.  High-Accuracy Phone Recognition By Combining High-Performance Lattice Generation and Knowledge Based Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Kai Feng,et al.  SUBSPACE GAUSSIAN MIXTURE MODELS FOR SPEECH RECOGNITION , 2009 .

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[17]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[18]  Lukás Burget,et al.  Simplification and optimization of i-vector extraction , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[20]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[21]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Petr Motlícek,et al.  Feature and score level combination of subspace Gaussinas in LVCSR task , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[24]  David Wright,et al.  Identifying idiolect in forensic authorship attribution: an n-gram textbite approach , 2014 .

[25]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[26]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[28]  Douglas A. Reynolds,et al.  Deep Neural Network Approaches to Speaker and Language Recognition , 2015, IEEE Signal Processing Letters.

[29]  Petr Motlícek,et al.  Employment of Subspace Gaussian Mixture Models in speaker recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Lukás Burget,et al.  Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Joaquín González-Rodríguez,et al.  On the use of deep feedforward neural networks for automatic language identification , 2016, Comput. Speech Lang..

[32]  Subhadeep Dey,et al.  Implementation of the Standard I-vector System for the Kaldi Speech Recognition Toolkit , 2016 .

[33]  Seyed Omid Sadjadi,et al.  The IBM 2016 Speaker Recognition System , 2016, Odyssey.

[34]  Aaron Lawson,et al.  Exploring the role of phonetic bottleneck features for speaker and language recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Petr Motlícek,et al.  Analysis of Language Dependent Front-End for Speaker Recognition , 2018, INTERSPEECH.

[36]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).