Language Resources for a Bilingual Automatic Index System of Broadcast News in Basque and Spanish

Automatic Indexing of Broadcast News is a developing research area of great recent interest [1]. This paper describes the development steps for designing an automatic index system of broadcast news for both Basque and Spanish. This application requires of appropriate Language Resources to design all the components of the system. Nowadays, large and well-defined resources can be found in most widely used languages, but there is a lot of work to do with respect to minority languages. Even if Spanish has much more resources than Basque, this work has parallel efforts for both languages. These two languages have been chosen because they are evenly official in the Basque Autonomous Community and they are used in many mass media of the Community including the Basque Public Radio and Television EITB [2].

[1]  João Paulo da Silva Neto,et al.  The COST278 Pan-European Broadcast News Database , 2004, LREC.

[2]  Kepa Sarasola,et al.  Automatic morphological analysis of Basque , 1996 .

[3]  Manuel Graña,et al.  Selection of Lexical Units for Continuous Speech Recognition of Basque , 2003, CIARP.

[4]  Mark Liberman,et al.  Transcriber: a free tool for segmenting, labeling and transcribing speech , 1998, LREC.

[5]  N. Ezeiza,et al.  Morphological segmentation for speech processing in Basque , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[6]  Karmele López de Ipiña,et al.  Using non-word lexical units in automatic speech understanding , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).