Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition

We examine the use of Deep Neural Networks (DNN) in extracting Baum-Welch statistics for i-vector-based textindependent speaker recognition. Instead of training the universal background model using the standard EM algorithm, the components are predefined and correspond to the set of triphone states, the posterior occupancy probabilities of which are modeled by a DNN. Those assignments are then combined with the standard 60-dim MFCC features to calculate first order BaumWelch statistics in order to train the i-vector extractor and extract i-vectors. The DNN-based assignment force the i-vectors to capture the idiosyncratic way in which each speaker pronounces each particular triphone state, which can enrich the standard short-term spectral representation of the standard ivectors. After experimenting with Switchboard data and a baseline PLDA classifier, our results showed that although the proposed i-vectors yield inferior performance compared to the standard ones, they are capable of attaining 16% relative improvement when fused with them, meaning that they carry useful complementary information about the speaker’s identity. A further experiment with a different DNN configuration attained comparable performance with the baseline i-vectors on NIST 2012 (condition C2, female).

[1]  Themos Stafylakis,et al.  I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[3]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[5]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[7]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[9]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Pietro Laface,et al.  Speaker recognition by means of Deep Belief Networks , 2013 .

[11]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[12]  Themos Stafylakis,et al.  Preliminary investigation of Boltzmann machine classifiers for speaker recognition , 2012, Odyssey.

[13]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[14]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Mireia Díez,et al.  Using phone log-likelihood ratios as features for speaker recognition , 2013, INTERSPEECH.