Exploring Transformer-based Language Recognition using Phonotactic Information

This paper describes an encoder-only approach based on the “Transformer architecture” applied to the language recognition (LRE) task using phonotactic information. Due to the use of one global set of phonemes to recognize all languages, the proposed system needs to overcome difficulties due to the overlapping and high co-occurrences of similar phone sequences across languages. To mitigate this issue, we propose a single transformer-based encoder trained for classification, where the attention mechanism and its capability of handling large sequences of phonemes help to find discriminative sequences of phonotactic units that contribute to correctly identify the language for short, mid and long audio segments. The proposed approach provides significant improvements, outperforming phonotactic-based RNNs and Glove-based i-Vectors architectures, getting a relative improvement of 5.5% and 38.5% respectively. Our experiments were carried out using phoneme sequences obtained by the “Allosaurus phoneme recognizer” applied to the Kalaka-3 Database. This dataset is challenging since the languages to identify are mostly similar (i.e. Iberian languages, e.g. Spanish, Galician, Catalan). We provide results using the Cavg metric proposed for NIST evaluations.

[1]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[2]  Luis Fernando D'Haro,et al.  Language Recognition using Neural Phone Embeddings and RNNLMs , 2018, IEEE Latin America Transactions.

[3]  Alan W Black,et al.  Universal Phone Recognition with a Multilingual Allophone System , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Mausam,et al.  Why and when should you pool? Analyzing Pooling in Recurrent Architectures , 2020, EMNLP.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Alvin F. Martin,et al.  The 2011 NIST Language Recognition Evaluation , 2010, INTERSPEECH.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[11]  Rubén San-Segundo-Hernández,et al.  On the use of Phone-based Embeddings for Language Recognition , 2018, IberSPEECH.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.

[14]  Mireia Díez,et al.  KALAKA-3: a database for the assessment of spoken language recognition technology on YouTube audios , 2016, Lang. Resour. Evaluation.

[15]  Rubén San-Segundo-Hernández,et al.  On the use of phone-gram units in recurrent neural networks for language identification , 2016, Odyssey.

[16]  Luyao Huang,et al.  Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence , 2019, NAACL.

[17]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[18]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[19]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[20]  S. C. Kremer,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .