Language Identification Using Time Delay Neural Network D-Vector on Short Utterances

This paper describes d-vector language identification (LID) system on short utterances using time delay neural network (TDNN) acoustic model for the speech recognition task. The acoustic TDNN model is chosen for ASR system of ICQ messenger and it’s applied for the LID task. We compared LID TDNN d-vector results to i-vector baseline. It was found that the TDNN system performance is close at any durations while i-vector shows good results only at long time. Open-set test is conducted. Relative improvement of 5.5 % over the i-vector system is shown.

[1]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[3]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[4]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[5]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Erik McDermott,et al.  Deep neural networks for small footprint text-dependent speaker verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[8]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.