Spoken Language Identification Using LSTM-Based Angular Proximity

This paper describes the design of an acoustic language identification (LID) system based on LSTMs that directly maps a sequence of acoustic features to a vector in a vector space where the angular proximity corresponds to a measure of language/dialect similarity. A specific architecture for the LSTMbased language vector extractor is introduced along with the angular proximity loss function to train it. This new LSTM-based LID system is quicker to train than a standard RNN topology using stacked layers trained with the cross-entropy loss function and obtains significantly lower language error rates. Experiments compare this approach to our previous developments on the subject, as well as to two widely used LID techniques: a phonotactic system using DNN acoustic models and an i-vector system. Results are reported on two different data sets: the 14 languages of NIST LRE07 and the 20 closely related languages and dialects of NIST LRE15. In addition to reporting the NIST Cavg metric which served as the primary metric for the LRE07 and LRE15 evaluations, the average LER is provided.

[1]  Jean-Luc Gauvain,et al.  Identifying non-linguistic speech features , 1993, EUROSPEECH.

[2]  Hervé Bredin,et al.  TristouNet: Triplet loss for speaker turn embedding , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[5]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[7]  Jean-Luc Gauvain,et al.  Language identification incorporating lexical information , 1998, ICSLP.

[8]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[9]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[10]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[12]  Jean-Luc Gauvain,et al.  A Divide-and-Conquer Approach for Language Identification Based on Recurrent Neural Networks , 2016, INTERSPEECH.

[13]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Jean-Luc Gauvain,et al.  Improved n-gram phonotactic models for language recognition , 2010, INTERSPEECH.

[15]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[16]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[17]  Jean-Luc Gauvain,et al.  Fusing language information from diverse data sources for phonotactic language recognition , 2012, Odyssey.

[18]  Jean-Luc Gauvain,et al.  Language Recognition for Dialects and Closely Related Languages , 2016, Odyssey.

[19]  Jean-Luc Gauvain,et al.  Phonotactic Language Recognition Using MLP Features , 2012, INTERSPEECH.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[22]  Martine Adda-Decker,et al.  Language identification using lattice-based phonotactic and syllabotactic approaches , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[23]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[24]  Shubha Kadambe,et al.  Language identification with phonological and lexical models , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[25]  Jean-Luc Gauvain,et al.  MinimumWord Error Training of RNN-based Voice Activity Detection , 2015 .

[26]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .