Bidirectional Modelling for Short Duration Language Identification

Language identification (LID) systems typically employ ivectors as fixed length representations of utterances. However, it may not be possible to reliably estimate i-vectors from short utterances, which in turn could lead to reduced language identification accuracy. Recently, Long Short Term Memory networks (LSTMs) have been shown to better model short utterances in the context of language identification. This paper explores the use of bidirectional LSTMs for language identification with the aim of modelling temporal dependencies between past and future frame based features in short utterances. Specifically, an end-to-end system for short duration language identification employing bidirectional LSTM models of utterances is proposed. Evaluations on both NIST 2007 and 2015 LRE show state-of-the-art performance.

[1]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[2]  Andrew W. Senior,et al.  Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition , 2014, ArXiv.

[3]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[4]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[5]  Hai Zhao,et al.  A Unified Tagging Solution: Bidirectional LSTM Recurrent Neural Network with Word Embedding , 2015, ArXiv.

[6]  J. Gonzalez-Dominguez,et al.  Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks , 2016, PloS one.

[7]  Douglas A. Reynolds,et al.  A unified deep neural network for speaker and language recognition , 2015, INTERSPEECH.

[8]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[9]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[11]  Bo Xu,et al.  End-to-End Language Identification Using Attention-Based Recurrent Neural Networks , 2016, INTERSPEECH.

[12]  Hermann Ney,et al.  Towards Online-Recognition with Deep Bidirectional LSTM Acoustic Models , 2016, INTERSPEECH.

[13]  Joaquín González-Rodríguez,et al.  Frame-by-frame language identification in short utterances using deep neural networks , 2015, Neural Networks.

[14]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[15]  Joaquín González-Rodríguez,et al.  Evaluation of an LSTM-RNN System in Different NIST Language Recognition Frameworks , 2016, Odyssey.

[16]  Li-Rong Dai,et al.  Exemplar based language recognition method for short-duration speech segments , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Daniel Ramos,et al.  On the Use of Convolutional Neural Networks in Pairwise Language Recognition , 2014, IberSPEECH.

[18]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[19]  Rong Tong,et al.  NIST 2007 Language Recognition Evaluation: From the Perspective of IIR , 2008, PACLIC.

[20]  Shrikanth S. Narayanan,et al.  Modified-prior i-vector estimation for language identification of short duration utterances , 2014, INTERSPEECH.

[21]  Hermann Ney,et al.  A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.