An unsupervised approach to language identification

This paper presents an unsupervised approach to automatic language identification (ALI) based on vowel system modeling. Each language vowel system is modeled by a Gaussian mixture model (GMM) trained with automatically detected vowels. Since this detection is unsupervised and language independent, no labeled data are required. GMMs are initialized using an efficient data-driven variant of the LBG algorithm: the LBG-Rissanen (1983) algorithm. With 5 languages from the OGI MLTS corpus and in a close set identification task, we reach 79% of correct identification using only the vowel segments detected in 45 second duration utterances for the male speakers.