Quantifying the MULTEXT-East morphosyntactic resources

The mid-nineties saw – to a large extent via EU projects – the rapid development of multilingual language resources and standards for human language technologies. However, while the development of resources, tools, and standards was well on its way for EU languages, there were no comparable efforts for the languages of Central and Eastern Europe. The MULTEXT-East project (Multilingual Text Tools and Corpora for Eastern and Central European Languages) was a spin-off of the EU MULTEXT project (Ide & Véronis 1994); MULTEXT-East ran from ’95 to ’97 and developed standardised language resources for six CEE languages (Dimitrova et al. 1998), as well as for English, the ‘hub’ language of the project. The main results of the project were lexical resources and an annotated multilingual corpus, where the most important resource turned out to be the parallel corpus – heavily annotated with structural and linguistic information – which consists of Orwell’s novel 1984 in the English original, and translations, as illustrated in Table 1.