论文信息 - HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification

HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification

The low-dimensional i-vector representation of speech segments is used in the state-of-the-art text-independent speaker verification systems. However, i-vectors were deemed unsuitable for the text-dependent task, where simpler and older speaker recognition approaches were found more effective. In this work, we propose a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification. In our approach, the Universal Background Model (UBM) for training phrase-independent i-vector extractor is based on a set of monophone HMMs instead of the standard Gaussian Mixture Model (GMM). To compensate for the channel variability, we propose to precondition i-vectors using a regularized variant of within-class covariance normalization, which can be robustly estimated in a phrase-dependent fashion on the small datasets available for the text-dependent task. The verification scores are cosine similarities between the i-vectors normalized using phrase-dependent s-norm. The experimental results on RSR2015 and RedDots databases confirm the effectiveness of the proposed approach, especially in rejecting test utterances with a wrong phrase. A simple MFCC based i-vector/HMM system performs competitively when compared to very computationally expensive DNN-based approaches or the conventional relevance MAP GMM-UBM, which does not allow for compact speaker representations. To our knowledge, this paper presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.

[1] Patrick Kenny,et al. A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Lukás Burget,et al. i-Vector/HMM Based Text-Dependent Speaker Verification System for RedDots Challenge , 2016, INTERSPEECH.

[3] Bin Ma,et al. Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[4] Bin Ma,et al. Phonetically-constrained PLDA modeling for text-dependent speaker verification with multiple short utterances , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5] Hagai Aronowitz,et al. Text dependent speaker verification using a small development set , 2012, Odyssey.

[6] James H. Elder,et al. Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[7] Alan McCree,et al. Insights into deep neural networks for speaker recognition , 2015, INTERSPEECH.

[8] Themos Stafylakis,et al. Joint Factor Analysis for Text-Dependent Speaker Verification , 2014, Odyssey.

[9] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] J. Oglesby,et al. Speaker recognition using hidden Markov models, dynamic time warping and vector quantisation , 1995 .

[11] Ruhi Sarikaya,et al. Bottleneck features for speaker recognition , 2012, Odyssey.

[12] J. Friedman. Regularized Discriminant Analysis , 1989 .

[13] Sergey Novoselov,et al. Text-dependent GMM-JFA system for password based speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Patrick Kenny,et al. Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Daniel Garcia-Romero,et al. Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[16] Steve Young,et al. The HTK book , 1995 .

[17] Lukás Burget,et al. Text-dependent speaker verification based on i-vectors, Neural Networks and Hidden Markov Models , 2017, Comput. Speech Lang..

[18] Douglas A. Reynolds,et al. A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[19] Luis A. Hernández Gómez,et al. Phoneme and sub-phoneme t-normalization for text-dependent speaker recognition , 2008, Odyssey.

[20] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[21] Sri Harish Reddy Mallidi,et al. Neural Network Bottleneck Features for Language Identification , 2014, Odyssey.

[22] Oren Barkan,et al. On leveraging conversational data for building a text dependent speaker verification system , 2013, INTERSPEECH.

[23] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Themos Stafylakis,et al. Text-dependent speaker recognition using PLDA with uncertainty propagation , 2013, INTERSPEECH.

[25] Qiguang Lin,et al. An HMM approach to text-prompted speaker verification , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[26] John S. D. Mason,et al. Constrained temporal structure for text-dependent speaker verification , 2013, Digit. Signal Process..

[27] Andreas Stolcke,et al. Within-class covariance normalization for SVM-based speaker recognition , 2006, INTERSPEECH.

[28] Alan McCree,et al. Improving speaker recognition performance in the domain adaptation challenge using deep neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[29] Themos Stafylakis,et al. JFA for speaker recognition with random digit strings , 2015, INTERSPEECH.

[30] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[31] Jan Cernocký,et al. BUT 2014 Babel system: analysis of adaptation in NN based systems , 2014, INTERSPEECH.

[32] Lukás Burget,et al. Deep Neural Networks and Hidden Markov Models in i-vector-based Text-Dependent Speaker Verification , 2016, Odyssey.

[33] Lukás Burget,et al. Investigation into bottle-neck features for meeting speech recognition , 2009, INTERSPEECH.

[34] Sergey Ioffe,et al. Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[35] Lukás Burget,et al. Comparison of scoring methods used in speaker recognition with Joint Factor Analysis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36] Themos Stafylakis,et al. Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[37] Themos Stafylakis,et al. JFA-based front ends for speaker recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Patrick Kenny,et al. Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[39] Lukás Burget,et al. Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Themos Stafylakis,et al. Speaker and Channel Factors in Text-Dependent Speaker Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41] Jing Li,et al. Support vector machines based text dependent speaker verification using HMM supervectors , 2008, Odyssey.

[42] Martin Karafiát,et al. The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[43] Bin Ma,et al. The reddots data collection for speaker recognition , 2015, INTERSPEECH.

[44] Liang He,et al. Investigation of bottleneck features and multilingual deep neural networks for speaker verification , 2015, INTERSPEECH.

[45] Hossein Sameti,et al. Telephony text-prompted speaker verification using i-vector representation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Lukás Burget,et al. Analysis of DNN approaches to speaker identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).