Lexica and corpora for speech-to-speech translation: a trilingual approach

Creation of lexica and corpora for Catalan, Spanish and US-English is described. A lexicon is being created for speech recognition and synthesis including relevant information. The lexicon contains 50K common words selected to achieve a wide coverage on the chosen domains, and 50K additional entries including special application words, and proper nouns. Furthermore, a large trilingual spontaneous speech corpus has been created. These corpora, together with other available US-Englishdata, have been translated into their counterpart languages. This is being used to investigate the language resources requirements for statistical machine translation.