Fusing language information from diverse data sources for phonotactic language recognition

The baseline approach in building phonotactic language recognition systems is to characterize each language by a single phonotactic model generated from all the available languagespecific training data. When several data sources are available for a given target language, system performance can be improved using language source-dependent phonotactic models. In this case, the common practice is to fuse language source information (i.e., the phonotactic scores for each language/source) early (at the input) to the backend. This paper proposes to postpone the fusion to the end (at the output) of the backend. In this case, the language recognition score can be estimated from well-calibrated language source scores. Experiments were conducted using the NIST LRE 2007 and the NIST LRE 2009 evaluation data sets with the 30s condition. On the NIST LRE 2007 eval data, a Cavg of 0.9% is obtained for the closed-set task and 2.5% for the open-set task. Compared to the common practice of early fusion, these results represent relative improvements of 18% and 11%, for the closed-set and open-set tasks, respectively. Initial tests on the NIST LRE 2009 eval data gave no improvement on the closedset task. Moreover, the Cllr measure indicates that language recognition scores estimated by the proposed approach are better calibrated than the common practice (early fusion).

[1]  N. Brummer,et al.  On calibration of language recognition scores , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[2]  Jean-Luc Gauvain,et al.  Gaussian Backend design for open-set language detection , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Jean-Luc Gauvain,et al.  Improved n-gram phonotactic models for language recognition , 2010, INTERSPEECH.

[4]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[5]  Jean-Luc Gauvain,et al.  Language score calibration using adapted Gaussian back-end , 2009, INTERSPEECH.

[6]  David A. van Leeuwen,et al.  On calibration of language recognition scores , 2006, Odyssey.

[7]  Lukás Burget,et al.  Advances in phonotactic language recognition , 2008, INTERSPEECH.

[8]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[9]  Douglas E. Sturim,et al.  The MITLL NIST LRE 2009 language recognition system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Jean-Luc Gauvain,et al.  Language recognition using phone latices , 2004, INTERSPEECH.

[11]  Pavel Matejka,et al.  Phonotactic language identification using high quality phoneme recognition , 2005, INTERSPEECH.

[12]  Douglas A. Reynolds,et al.  Beyond frame independence: parametric modelling of time duration in speaker and language recognition , 2008, INTERSPEECH.

[13]  Jean-Luc Gauvain,et al.  Context-dependent phone models and models adaptation for phonotactic language recognition , 2008, INTERSPEECH.