Large lexica for speech-to-speech translation: from specification to creation

This paper presents the corpora collection and lexica creation for the purposes of Automatic Speech Recognition (ASR) and Text-to-speech (TTS) that are needed in speech-to-speech translation (SST). These lexica will be specified, built and validated within the scope of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) during the years 2002-2005. Large lexica consisting of phonetic, prosodic and morpho-syntactic content will be provided with well-documented specifications for at least 12 languages [1]. This paper provides a short overview of the speech-to-speech translation lexica in general as well as a summary of the LC-STAR project itself. More detailed information about the specification for the corpora collection and word extraction as well as the specification and format of the lexica are presented in later chapters.