MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel “1984” by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.

[1]  Adam Przepiórkowski,et al.  A Flexemic Tagset for Polish , 2003 .

[2]  Ales Horák,et al.  Slovak National Corpus , 2004, TSD.

[3]  Marko Tadić,et al.  Building the Croatian Morphological Lexicon , 2003 .

[4]  M. Piasecki,et al.  Polish tagger TaKIPI: rule based construction and optimization , 2007 .

[5]  Jan Hajic,et al.  Morphological Tagging: Data vs. Dictionaries , 2000, ANLP.

[6]  Joel D. Martin,et al.  Word Alignment for Languages with Scarce Resources , 2005, ParallelText@ACL.

[7]  Dan Tufis Tiered Tagging and Combined Language Models Classifiers , 1999, TSD.

[8]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[9]  Tomaž Erjavec,et al.  MULTEXT-East Resources for Serbian , 2004 .

[10]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[11]  Han Stenwijk,et al.  The Slovene Dialect of Resia: San Giorgio , 1994 .

[12]  Viktor Vojnovski,et al.  LEARNING POS TAGGING FROM A TAGGED MACEDONIAN TEXT CORPUS , 2005 .

[13]  Behrang Q. Zadeh,et al.  Persian in MULTEXT-East Framework , 2006, FinTAL.

[14]  Raymond H. Miller Han Steenwijk, The Slovene Dialect of Resia: San Giorgio. Amsterdam, 1992. , 1995 .

[15]  Max Silberztein,et al.  Text Indexation with INTEX , 1999, Comput. Humanit..

[16]  Saso Dzeroski,et al.  Towards a Slovene Dependency Treebank , 2006, LREC.

[17]  Marko Tadic,et al.  Building the Croatian National Corpus , 2002, LREC.

[18]  János Csirik,et al.  Manually annotated Hungarian corpus , 2003 .

[19]  Nancy Ide,et al.  MULTEXT: Multilingual Text Tools and Corpora , 1994, COLING.

[20]  Balázs Kis,et al.  A Unification-based Approach to Morpho-syntactic Parsing of Agglutinative and Other (Highly) Inflectional Languages , 1999, ACL.

[21]  C. M. Sperberg-McQueen,et al.  Guidelines for electronic text encoding and interchange , 1994 .

[22]  Natalia Kotsyba,et al.  Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian * , 2013 .

[23]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[24]  Anna Feldman,et al.  A Resource-Light Approach to Morpho-Syntactic Tagging , 2009 .

[25]  Nancy Ide,et al.  © 1999 Kluwer Academic Publishers. Printed in the Netherlands Cross-lingual Sense Determination: Can It Work? , 2022 .

[26]  Nancy Ide,et al.  Corpues enconding standard: SGML guidelines for encoding linguistic corpora , 1998, LREC.

[27]  Christian Chiarcos,et al.  OWL/DL formalization of the MULTEXT-East morphosyntactic specifications , 2011, Linguistic Annotation Workshop.

[28]  Simon Krek,et al.  The JOS Linguistically Tagged Corpus of Slovene , 2010, LREC.

[29]  Saso Dzeroski,et al.  DEPARTMENT OF INTELLIGENT SYSTEMS , 2019 .

[30]  Kristina Toutanova,et al.  A global model for joint lemmatization and part-of-speech prediction , 2009, ACL.

[31]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[32]  Katerina Zdravkova,et al.  LEARNING RULES FOR MORPHOLOGICAL ANALYSIS AND SYNTHESIS OF MACEDONIAN NOUNS , 2005 .

[33]  P. Osenova,et al.  ‘An HPSG-based Syntactic Treebank of Bulgarian (BulTreeBank)’ , 2002 .

[34]  Dan Tufis A Cheap and Fast Way to Build Useful Translation Lexicons , 2002, COLING.

[35]  Tomaz Erjavec,et al.  Designing and Evaluating a Russian Tagset , 2008, LREC.

[36]  Marc Kemps-Snijders,et al.  ISOcat: Corralling Data Categories in the Wild , 2008, LREC.

[37]  Paul Rayson,et al.  Corpus linguistics around the world , 2006 .

[38]  Serge Sharoff,et al.  Methods and tools for development of the Russian Reference Corpus , 2006 .