论文信息 - Quantifying the MULTEXT-East morphosyntactic resources

Quantifying the MULTEXT-East morphosyntactic resources

The mid-nineties saw – to a large extent via EU projects – the rapid development of multilingual language resources and standards for human language technologies. However, while the development of resources, tools, and standards was well on its way for EU languages, there were no comparable efforts for the languages of Central and Eastern Europe. The MULTEXT-East project (Multilingual Text Tools and Corpora for Eastern and Central European Languages) was a spin-off of the EU MULTEXT project (Ide & Véronis 1994); MULTEXT-East ran from ’95 to ’97 and developed standardised language resources for six CEE languages (Dimitrova et al. 1998), as well as for English, the ‘hub’ language of the project. The main results of the project were lexical resources and an annotated multilingual corpus, where the most important resource turned out to be the parallel corpus – heavily annotated with structural and linguistic information – which consists of Orwell’s novel 1984 in the English original, and translations, as illustrated in Table 1.

Tomaž Erjavec | T. Erjavec

[1] Nancy Ide,et al. MULTEXT: Multilingual Text Tools and Corpora , 1994, COLING.

[2] Dan Tufis. Tiered Tagging and Combined Language Models Classifiers , 1999, TSD.

[3] Marko Tadić,et al. The MULTEXT-East Morphosyntactic Specification for Slavic Languages , 2003 .

[4] Tomaz Erjavec,et al. MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[5] Nancy Ide,et al. Sense Discrimination with Parallel Corpora , 2002, SENSEVAL.

[6] C. M. Sperberg-McQueen,et al. Guidelines for electronic text encoding and interchange , 1994 .

[7] Nancy Ide,et al. Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.