Text Prompted Speaker Verification Based on Phoneme Clustering with Earth Mover's Distane and Cauchy-Schwarz Divergence

For short duration text prompted speaker verification where the amount of enrollment data is limited for each speaker model, it is hard to obtain a robust speaker representation. In these situations of short utterance speaker verification I-vector/GMM approaches work even worse than traditional GMM-MAP modeling method. GMM/HMM framework content matching is one of the state-of-the-art paradigms for short duration text-dependent speaker verification, in which models for individual lexical such as words, syllables, or phonemes are established for the background and speaker to make up mismatch. However, some of the phonemes do not occur in enrollment but happen in the testing recordings, and most of the phonemes have different preceding and succeeding phonemes, both of which leads to coarticulation difference. These are called lexical and context mismatch. In this work, to overcome the data sparceness caused lexical mismatch and context mismatch, phoneme states are clustered applying Earth Mover's Distance and Cauchy-Schwarz divergence as metrics. Performance improved as EER lowered by 6.2%, minDCF08 lowered by 1.9% for Earth Mover's Distance metric, and EER lowered by 3.7%, minDCF08 rised 1.9% for Cauchy-Schwarz divergence metric.

[1]  Bin Ma,et al.  The reddots data collection for speaker recognition , 2015, INTERSPEECH.

[2]  Bin Ma,et al.  Joint Application of Speech and Speaker Recognition for Automation and Security in Smart Home , 2011, INTERSPEECH.

[3]  Thomas Fang Zheng,et al.  A Cohort-Based Speaker Model Synthesis for Mismatched Channels in Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[5]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[6]  Yun Lei,et al.  Content matching for short duration speaker recognition , 2014, INTERSPEECH.

[7]  Bin Ma,et al.  Joint Speaker and Lexical Modeling for Short-Term Characterization of Speaker , 2016, INTERSPEECH.

[8]  Zheng-Hua Tan,et al.  Text Dependent Speaker Verification Using Un-Supervised HMM-UBM and Temporal GMM-UBM , 2016, INTERSPEECH.

[9]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[10]  Thomas Fang Zheng,et al.  Deep speaker verification: Do we need end to end? , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[11]  Bernd Freisleben,et al.  Fast and Robust Speaker Clustering Using the Earth Mover'S Distance and Mixmax Models , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Robert Jenssen,et al.  Optimizing the Cauchy-Schwarz PDF Distance for Information Theoretic, Non-parametric Clustering , 2005, EMMCVPR.

[13]  José Carlos Príncipe,et al.  Closed-form cauchy-schwarz PDF divergence for mixture of Gaussians , 2011, The 2011 International Joint Conference on Neural Networks.

[14]  Leonidas J. Guibas,et al.  Supervised Earth Mover's Distance Learning and Its Computer Vision Applications , 2012, ECCV.

[15]  Abeer Alwan,et al.  CNN-Based Joint Mapping of Short and Long Utterance i-Vectors for Speaker Verification Using Short Utterances , 2017, INTERSPEECH.

[16]  Douglas E. Sturim,et al.  SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[17]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[18]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[19]  Bin Ma,et al.  The RSR2015: Database for Text-Dependent Speaker Verification using Multiple Pass-Phrases , 2012, Interspeech 2012.

[20]  Jia Liu,et al.  Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification , 2018, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[21]  Lei Zhang,et al.  A Novel Earth Mover's Distance Methodology for Image Matching with Gaussian Mixture Models , 2013, 2013 IEEE International Conference on Computer Vision.