Knowledge Distillation and Random Erasing Data Augmentation for Text-Dependent Speaker Verification

This paper explores the Knowledge Distillation (KD) approach and a data augmentation technique to improve the generalization ability and robustness of text-dependent speaker verification (SV) systems. The KD method consists of two neural networks, known as Teacher and Student, where the student is trained to replicate the predictions from the teacher, so it learns their variability during the training process. To provide robustness to the distillation process, we apply Random Erasing (RE), a data augmentation technique which was created to improve the generalization ability of the neural networks. We have developed two alternatives of the combination of KD and RE, which, produce a more robust system with better performance, since the student network can learn from teacher predictions of data not existing in the original dataset. All alternatives were tested on RSR2015-Part I database, where the proposed variants outperform reference system based on a single network using RE.

[1]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Sheng Li,et al.  Interactive Learning of Teacher-student Model for Short Utterance Spoken Language Identification , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Victoria Mingote,et al.  Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification , 2019 .

[5]  Vivek Rathod,et al.  Bayesian dark knowledge , 2015, NIPS.

[6]  Xiong Xiao,et al.  Developing Far-Field Speaker System Via Teacher-Student Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[8]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[9]  Shuai Wang,et al.  Knowledge Distillation for Small Foot-print Deep Speaker Embedding , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Eduardo Lleida,et al.  Optimization of the Area Under the ROC Curve using Neural Network Supervectors for Text-Dependent Speaker Verification , 2019, Comput. Speech Lang..

[11]  Kai Yu,et al.  Knowledge Distillation for Sequence Model , 2018, INTERSPEECH.

[12]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[13]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[14]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[15]  Ya Zhang,et al.  Deep feature for text-dependent speaker verification , 2015, Speech Commun..

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Hisashi Kawai,et al.  Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification , 2018, INTERSPEECH.

[18]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[19]  Vineeth N. Balasubramanian,et al.  Deep Model Compression: Distilling Knowledge from Noisy Teachers , 2016, ArXiv.

[20]  Bin Ma,et al.  Text-dependent speaker verification: Classifiers, databases and RSR2015 , 2014, Speech Commun..

[21]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[22]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[24]  Sergey Novoselov,et al.  On Residual CNN in Text-Dependent Speaker Verification Task , 2017, SPECOM.

[25]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.