Improved deep speaker feature learning for text-dependent speaker recognition

A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.

[1]  Patrick Kenny,et al.  Speaker and Session Variability in GMM-Based Speaker Verification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.

[3]  Thomas Fang Zheng,et al.  Discriminative scoring for speaker recognition based on I-vectors , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[4]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[6]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[7]  Haizhou Li,et al.  An overview of text-independent speaker recognition: From features to supervectors , 2010, Speech Commun..

[8]  Yifan Gong,et al.  Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[9]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[10]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[11]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.