论文信息 - Multi channel far field speaker verification using teacher student deep neural networks

Multi channel far field speaker verification using teacher student deep neural networks

Far field input utterance is one of the major causes of performance degradation of speaker verification systems. In this study, we used teacher student learning framework to compensate for the performance degradation caused by far field utterances. Teacher student learning refers to training the student deep neural network in possible performance degradation condition using the teacher deep neural network trained without such condition. In this study, we use the teacher network trained with near distance utterances to train the student network with far distance utterances. However, through experiments, it was found that performance of near distance utterances were deteriorated. To avoid such phenomenon, we proposed techniques that use trained teacher network as initialization of student network and training the student network using both near and far field utterances. Experiments were conducted using deep neural networks that input raw waveforms of 4-channel utterances recorded in both near and far distance. Results show the equal error rate of near and far-field utterances respectively, 2.55 % / 2.8 % without teacher student learning, 9.75 % / 1.8 % for conventional teacher student learning, and 2.5 % / 2.7 % with proposed techniques.

Hye-jin Shim | Hee-Soo Heo | Jee-weon Jung | Ha-Jin Yu

[1] Yoshua Bengio,et al. Batch-normalized joint training for DNN-based distant speech recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2] Xiong Xiao,et al. Developing Far-Field Speaker System Via Teacher-Student Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Michael S. Brandstein,et al. Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[4] Hye-jin Shim,et al. A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Wonyong Sung,et al. A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[6] Hye-jin Shim,et al. Avoiding Speaker Overfitting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[7] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[9] Yifan Gong,et al. Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.