Full-Info Training for Deep Speaker Feature Learning

In recent studies, it has shown that speaker patterns can be learned from very short speech segments (e.g., 0.3 seconds) by a carefully designed convolutional & time-delay deep neural network (CT-DNN) model. By enforcing the model to discriminate the speakers in the training data, frame-level speaker features can be derived from the last hidden layer. In spite of its good performance, a potential problem of the present model is that it involves a parametric classifier, i.e., the last affine layer, which may consume some discriminative knowledge, thus leading to ‘information leak’ for the feature learning. This paper presents a full-info training approach that discards the parametric classifier and enforces all the discriminative knowledge learned by the feature net. Our experiments on the Fisher database demonstrate that this new training scheme can produce more coherent features, leading to consistent and notable performance improvement on the speaker verification task.

[1]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[2]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[3]  Dong Wang,et al.  Speaker recognition with cough, laugh and "Wei" , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[4]  Themos Stafylakis,et al.  Deep Neural Networks for extracting Baum-Welch statistics for Speaker Recognition , 2014, Odyssey.

[5]  Thomas Fang Zheng,et al.  Deep speaker verification: Do we need end to end? , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[6]  F. Mueen,et al.  Speaker recognition using artificial neural networks , 2002, IEEE Students Conference, ISCON '02. Proceedings..

[7]  Yun Lei,et al.  A novel scheme for speaker recognition using a phonetically-aware deep neural network , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yifan Gong,et al.  End-to-End attention based text-dependent speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[9]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[10]  Dong Wang,et al.  Deep Speaker Feature Learning for Text-Independent Speaker Verification , 2017, INTERSPEECH.

[11]  Georg Heigold,et al.  End-to-end text-dependent speaker verification , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Richard J. Mammone,et al.  Speaker Recognition Using Neural Tree Networks , 1993, NIPS.

[13]  John H. L. Hansen,et al.  Speaker Recognition by Machines and Humans: A tutorial review , 2015, IEEE Signal Processing Magazine.

[14]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[15]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[16]  Homayoon Beigi,et al.  Fundamentals of Speaker Recognition , 2011 .

[17]  Thomas Fang Zheng,et al.  Cross-lingual speaker verification with deep feature learning , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[18]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[21]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..