Integrating online i-vector into GMM-UBM for text-dependent speaker verification

GMM-UBM is widely used for the text-dependent task for its simplicity and effectiveness, while i-vector provides a compact representation for speaker information. Thus it is promising to fuse these two frameworks. In this paper, a variation of traditional i-vector extracted at frame level is appended with MFCC as tandem features. Incorporating this feature into GMM-UBM system achieves 26% and 41% performance gain compared with DNN /-vector baseline on the RSR2015 and RedDots evaluation set, respectively. Moreover, the performance of the proposed system that trained on 86 hours data is on par with that of the DNN i-vector baseline trained on a much larger dataset (5000 hours).

[1]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Sanjeev Khudanpur,et al.  Reverberation robust acoustic modeling using i-vectors with time delay neural networks , 2015, INTERSPEECH.

[4]  Daniel Garcia-Romero,et al.  Time delay deep neural network-based universal background models for speaker recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Philip C. Woodland,et al.  Very deep convolutional neural networks for robust speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[6]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Petr Motlícek,et al.  Template-matching for text-dependent speaker verification , 2017, Speech Commun..

[8]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[9]  Petr Motlícek,et al.  Integrating online i-vector extractor with information bottleneck based speaker diarization system , 2015, INTERSPEECH.

[10]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Lukás Burget,et al.  Analysis of the DNN-based SRE systems in multi-language conditions , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[12]  Yuan Liu,et al.  Tandem deep features for text-dependent speaker verification , 2014, INTERSPEECH.

[13]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[14]  Najim Dehak,et al.  Discriminative and generative approaches for long- and short-term speaker characteristics modeling: application to speaker verification , 2009 .

[15]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[17]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..