Short Utterance Compensation in Speaker Verification via Cosine-Based Teacher-Student Learning of Speaker Embeddings

The short duration of an input utterance is one of the most critical threats that degrade the performance of speaker verification systems. This study aimed to develop an integrated text-independent speaker verification system that inputs utterances with short duration of 2 seconds or less. We propose an approach using a teacher-student learning framework for this goal, applied to short utterance compensation for the first time in our knowledge. The core concept of the proposed system is to conduct the compensation throughout the network that extracts the speaker embedding, mainly in phonetic-level, rather than compensating via a separate system after extracting the speaker embedding. In the proposed architecture, phonetic-level features where each feature represents a segment of 130 ms are extracted using convolutional layers. A layer of gated recurrent units extracts an utterance-level feature using phonetic-level features. The proposed approach also adopts a new objective function for teacher-student learning that considers both Kullback-Leibler divergence of output layers and cosine distance of speaker embeddings layers. Experiments were conducted using deep neural networks that take raw waveforms as input, and output speaker embeddings on VoxCelebl dataset. The proposed model showed 16.6 % relative improvement compared to a baseline approach.

[1]  Xiong Xiao,et al.  Developing Far-Field Speaker System Via Teacher-Student Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[3]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[4]  Jungwon Lee,et al.  Bridgenets: Student-Teacher Transfer Learning Based on Recursive Neural Networks and Its Application to Distant Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[6]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.

[7]  Sébastien Marcel,et al.  Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Ha-Jin Yu,et al.  Joint Training of Expanded End-to-End DNN for Text-Dependent Speaker Verification , 2017, INTERSPEECH.

[9]  Hisashi Kawai,et al.  Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification , 2018, INTERSPEECH.

[10]  M. Ordin,et al.  Acquisition of speech rhythm in a second language by learners with rhythmically different native languages. , 2015, The Journal of the Acoustical Society of America.

[11]  Hye-jin Shim,et al.  RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification , 2019, INTERSPEECH.

[12]  Paavo Alku,et al.  Accounting for uncertainty of i-vectors in speaker recognition using uncertainty propagation and modified imputation , 2015, INTERSPEECH.

[13]  G. E. Peterson,et al.  Duration of Syllable Nuclei in English , 1960 .

[14]  Sridha Sridharan,et al.  i-vector Based Speaker Recognition on Short Utterances , 2011, INTERSPEECH.

[15]  Hitoshi Yamamoto,et al.  Denoising autoencoder-based speaker feature restoration for utterances of short duration , 2015, INTERSPEECH.

[16]  Ha-Jin Yu,et al.  Applying compensation techniques on i-vectors extracted from short-test utterances for speaker verification using deep neural network , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hye-jin Shim,et al.  Avoiding Speaker Overfitting in End-to-End DNNs Using Raw Waveform for Text-Independent Speaker Verification , 2018, INTERSPEECH.

[18]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Yoshua Bengio,et al.  Interpretable Convolutional Filters with SincNet , 2018, ArXiv.

[20]  Tara N. Sainath,et al.  Compression of End-to-End Models , 2018, INTERSPEECH.

[21]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[24]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hye-jin Shim,et al.  A Complete End-to-End Speaker Verification System Using Deep Neural Networks: From Raw Signals to Verification Result , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Koichi Shinoda,et al.  I-vector Transformation Using Conditional Generative Adversarial Networks for Short Utterance Speaker Verification , 2018, INTERSPEECH.