Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification

The performance of spoken language identification (LID) on short utterances is drastically degraded even though model is completely trained on short utterance data set. The degradation is because of the large pattern confusion caused by the large variation of feature representation on short utterances. In this paper, we propose a teacher-student network learning algorithm to explore discriminative features for short utterances. With the teacher-student network learning, the feature representation for short utterances (explored by the student network) are normalized to their representations corresponding to long utterances (provided by the teacher network). With this learning algorithm, the feature representation on short utterances is supposed to reduce pattern confusion. Experiments on a 10-language LID task were carried out to test the algorithm. Our results showed the proposed algorithm significantly improved the performance.

[1]  Yu Tsao,et al.  Regularization of neural network model with distance metric learning for i-vector based spoken language identification , 2017, Comput. Speech Lang..

[2]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[3]  Alan McCree,et al.  Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15 , 2016, Odyssey.

[4]  Yu Tsao,et al.  Pair-Wise Distance Metric Learning of Neural Network Model for Spoken Language Identification , 2016, INTERSPEECH.

[5]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[6]  G. Montavon Deep learning for spoken language identification , 2009 .

[7]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[8]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Chin-Hui Lee,et al.  Principles of Spoken Language Recognition , 2008 .

[11]  Bo Xu,et al.  End-to-End Language Identification Using Attention-Based Recurrent Neural Networks , 2016, INTERSPEECH.

[12]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[14]  Vidhyasaharan Sethu,et al.  Bidirectional Modelling for Short Duration Language Identification , 2017, INTERSPEECH.

[15]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[16]  Seyed Omid Sadjadi,et al.  Nearest neighbor discriminant analysis for language recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Lemao Liu,et al.  Local fisher discriminant analysis for spoken language identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[19]  John H. L. Hansen,et al.  Language recognition using deep neural networks with very limited training data , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).