A Multilingual Corpus for Language Identification

In this paper we describe the design, recording, and transcription of a large, multilingual (French, English, German and Spanish) corpus of telephone speech for research in automatic language identification. The corpus contains over 250 calls from native speakers of each language from their home country, and an additional 50 calls per language from another country. Although the same recording protocol was used for all languages, slight modifications were necessary to account for language or country specificities. Issues in designing comparable corpora in different languages are addressed, including how to interact with callers so as to obtain the desired responses.

[1]  Ineke Schuurman First International Conference on Language Resources and Evaluation , 1998 .

[2]  Lori Lamel,et al.  Multilingual phone recognition of spontaneous telephone speech , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Jean-Luc Gauvain,et al.  Language identification with language-independent acoustic models , 1997, EUROSPEECH.

[4]  Ronald A. Cole,et al.  The OGI multi-language telephone speech corpus , 1992, ICSLP.