HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification

The low-dimensional i-vector representation of speech segments is used in the state-of-the-art text-independent speaker verification systems. However, i-vectors were deemed unsuitable for the text-dependent task, where simpler and older speaker recognition approaches were found more effective. In this work, we propose a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification. In our approach, the Universal Background Model (UBM) for training phrase-independent i-vector extractor is based on a set of monophone HMMs instead of the standard Gaussian Mixture Model (GMM). To compensate for the channel variability, we propose to precondition i-vectors using a regularized variant of within-class covariance normalization, which can be robustly estimated in a phrase-dependent fashion on the small datasets available for the text-dependent task. The verification scores are cosine similarities between the i-vectors normalized using phrase-dependent s-norm. The experimental results on RSR2015 and RedDots databases confirm the effectiveness of the proposed approach, especially in rejecting test utterances with a wrong phrase. A simple MFCC based i-vector/HMM system performs competitively when compared to very computationally expensive DNN-based approaches or the conventional relevance MAP GMM-UBM, which does not allow for compact speaker representations. To our knowledge, this paper presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.

[1]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Lukás Burget,et al.  i-Vector/HMM Based Text-Dependent Speaker Verification System for RedDots Challenge , 2016, INTERSPEECH.

[3]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[4]  Bin Ma,et al.  Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Hagai Aronowitz,et al.  Text dependent speaker verification using a small development set , 2012, Odyssey.

[6]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[7]  Alan McCree,et al.  Insights into deep neural networks for speaker recognition , 2015, INTERSPEECH.

[8]  Themos Stafylakis,et al.  Joint Factor Analysis for Text-Dependent Speaker Verification , 2014, Odyssey.

[9]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  J. Oglesby,et al.  Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation , 1995 .

[11]  Ruhi Sarikaya,et al.  Bottleneck features for speaker recognition , 2012, Odyssey.

[12]  J. Friedman Regularized Discriminant Analysis , 1989 .

[13]  Sergey Novoselov,et al.  Text-dependent GMM-JFA system for password based speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[16]  Steve Young,et al.  The HTK book , 1995 .

[17]  Lukás Burget,et al.  Text-dependent speaker verification based on i-vectors, Neural Networks and Hidden Markov Models , 2017, Comput. Speech Lang..

[18]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[19]  Luis A. Hernández Gómez,et al.  Phoneme and sub-phoneme t-normalization for text-dependent speaker recognition , 2008, Odyssey.

[20]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[21]  Sri Harish Reddy Mallidi,et al.  Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[22]  Oren Barkan,et al.  On leveraging conversational data for building a text dependent speaker verification system , 2013, INTERSPEECH.

[23]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Themos Stafylakis,et al.  Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[25]  Qiguang Lin,et al.  An HMM approach to text-prompted speaker verification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26]  John S. D. Mason,et al.  Constrained temporal structure for text-dependent speaker verification , 2013, Digit. Signal Process..

[27]  Andreas Stolcke,et al.  Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[28]  Alan McCree,et al.  Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[29]  Themos Stafylakis,et al.  JFA for speaker recognition with random digit strings , 2015, INTERSPEECH.

[30]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  Jan Cernocký,et al.  BUT 2014 Babel system: analysis of adaptation in NN based systems , 2014, INTERSPEECH.

[32]  Lukás Burget,et al.  Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker Verification , 2016, Odyssey.

[33]  Lukás Burget,et al.  Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[34]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[35]  Lukás Burget,et al.  Comparison of scoring methods used in speaker recognition with Joint Factor Analysis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[37]  Themos Stafylakis,et al.  JFA-based front ends for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[39]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Themos Stafylakis,et al.  Speaker and Channel Factors in Text-Dependent Speaker Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Jing Li,et al.  Support vector machines based text dependent speaker verification using HMM supervectors , 2008, Odyssey.

[42]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[43]  Bin Ma,et al.  The reddots data collection for speaker recognition , 2015, INTERSPEECH.

[44]  Liang He,et al.  Investigation of bottleneck features and multilingual deep neural networks for speaker verification , 2015, INTERSPEECH.

[45]  Hossein Sameti,et al.  Telephony text-prompted speaker verification using i-vector representation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Lukás Burget,et al.  Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).