Knowledge Distillation-Based Representation Learning for Short-Utterance Spoken Language Identification

With successful applications of deep feature learning algorithms, spoken language identification (LID) on long utterances obtains satisfactory performance. However, the performance on short utterances is drastically degraded even when the LID system is trained using short utterances. The main reason is due to the large variation of the representation on short utterances which results in high model confusion. To narrow the performance gap between long, and short utterances, we proposed a teacher-student representation learning framework based on a knowledge distillation method to improve LID performance on short utterances. In the proposed framework, in addition to training the student model on short utterances with their true labels, the internal representation from the output of a hidden layer of the student model is supervised with the representation corresponding to their longer utterances. By reducing the distance of internal representations between short, and long utterances, the student model can explore robust discriminative representations for short utterances, which is expected to reduce model confusion. We conducted experiments on our in-house LID dataset, and NIST LRE07 dataset, and showed the effectiveness of the proposed methods for short utterance LID tasks.

[1]  Hisashi Kawai,et al.  Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification , 2018, INTERSPEECH.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Chin-Hui Lee,et al.  Principles of Spoken Language Recognition , 2008 .

[5]  Seyed Omid Sadjadi,et al.  Nearest neighbor discriminant analysis for language recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Alan McCree,et al.  Augmented Data Training of Joint Acoustic/Phonotactic DNN i-vectors for NIST LRE15 , 2016, Odyssey.

[7]  Douglas A. Reynolds,et al.  Performance Analysis of the 2017 NIST Language Recognition Evaluation , 2018, INTERSPEECH.

[8]  Joaquín González-Rodríguez,et al.  On the use of deep feedforward neural networks for automatic language identification , 2016, Comput. Speech Lang..

[9]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[10]  G. Montavon Deep learning for spoken language identification , 2009 .

[11]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[12]  Lemao Liu,et al.  Local fisher discriminant analysis for spoken language identification , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Bo Xu,et al.  End-to-End Language Identification Using Attention-Based Recurrent Neural Networks , 2016, INTERSPEECH.

[14]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[15]  Ming Li,et al.  Insights in-to-End Learning Scheme for Language Identification , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[18]  Vidhyasaharan Sethu,et al.  Bidirectional Modelling for Short Duration Language Identification , 2017, INTERSPEECH.

[19]  Yu Tsao,et al.  Pair-Wise Distance Metric Learning of Neural Network Model for Spoken Language Identification , 2016, INTERSPEECH.

[20]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Yu Tsao,et al.  Regularization of neural network model with distance metric learning for i-vector based spoken language identification , 2017, Comput. Speech Lang..

[22]  Ryo Masumura,et al.  Domain adaptation of DNN acoustic models using knowledge distillation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jean-Luc Gauvain,et al.  A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks , 2016, INTERSPEECH.

[24]  John H. L. Hansen,et al.  Language recognition using deep neural networks with very limited training data , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[26]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[27]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Matias Lindgren,et al.  Deep learning for spoken language identification , 2020 .

[29]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[31]  Ming Li,et al.  Insights into End-to-End Learning Scheme for Language Identification , 2018 .