论文信息 - Deep Speaker Feature Learning for Text-Independent Speaker Verification

Deep Speaker Feature Learning for Text-Independent Speaker Verification

Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.

[1] Sergey Ioffe,et al. Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[2] Erik McDermott,et al. Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Yifan Gong,et al. End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[4] Douglas A. Reynolds,et al. Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[5] Dong Wang,et al. Improved deep speaker feature learning for text-dependent speaker recognition , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[6] Haizhou Li,et al. An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[7] Ya Zhang,et al. Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[8] Sanjeev Khudanpur,et al. Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[9] Patrick Kenny,et al. Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[10] Georg Heigold,et al. End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Patrick Kenny,et al. Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[13] Liu Jia,et al. Speaker recognition system based on deep neural networks and bottleneck features , 2016 .

[14] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[15] Yun Lei,et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Douglas E. Sturim,et al. Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[17] Themos Stafylakis,et al. Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[18] Sanjeev Khudanpur,et al. Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[19] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.