MuLTILINguAL RESOuRcES, TEcHNOLOgIES ANd EvALuATION fOR cENTRAL ANd EASTERN EuROPEAN LANguAgES

This paper discusses the building of the first Bulgarian– Polish–Lithuanian (for short, BG–PL–LT) experimental corpus. The BG–PL–LT corpus (currently under development only for research) contains more than 3 million words and comprises two corpora: parallel and comparable. The BG–PL– LT parallel corpus contains more than 1 million words. A small part of the parallel corpus comprises original texts in one of the three languages with translations in two others, and texts of official documents of the European Union available through the Internet. The texts (fiction) in other languages translated into Bulgarian, Polish, and Lithuanian form the main part of the parallel corpus. The comparable BG–PL–LT corpus includes: (1) texts in Bulgarian, Polish and Lithuanian with the text sizes being comparable across the three languages, mainly fiction, and (2) excerpts from E-media newspapers, distributed via Internet and with the same thematic content. Some of the texts have been annotated at paragraph level. This allows texts in all three languages and in pairs BG–PL, PL–LT, BG–LT, and vice versa to be aligned at paragraph level in order to produces aligned threeand bilingual corpora. The authors focused their attention on the morphosyntactic annotation of the parallel trilingual corpus, according to the Corpus Encoding Standard (CES). The tagsets for corpora annotation are briefly discussed from the point of view of possible unification in future. Some examples are presented.

[1]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[2]  Montse Maritxalar Anglada,et al.  Automatic acquisition of didactic resources: generating test-based questions , 2007 .

[3]  Igor Mel’čuk,et al.  Dependency Syntax: Theory and Practice , 1987 .

[4]  Chao-Lin Liu,et al.  Applications of Lexical Information for Algorithmically Composing Multiple-Choice Cloze Items , 2005 .

[5]  A. Teischinger,et al.  BUILDING LANGUAGE RESOURCES AND TRANSLATION MODELS FOR MACHINE TRANSLATION FOCUSED ON SOUTH SLAVIC AND BALKAN LANGUAGES , 2008 .

[6]  Gordana Pavlovic-Lazetic,et al.  Combining Heterogeneous Lexical Resources , 2004, LREC.

[7]  Cvetana Krstev,et al.  The Usage of Various Lexical Resources and Tools to Improve the Performance of Web Search Engines , 2008, LREC.

[8]  Nancy Ide,et al.  Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages , 1998, COLING-ACL.

[9]  Kiyong Lee Language Resource Management – Feature Structures , 2003 .

[10]  Preslav Nakov BulStem: Design and Evaluation of Inflectional Stemmer for Bulgarian , 1998 .

[11]  Michel Simard,et al.  Using cognates to align sentences in bilingual corpora , 1993, TMI.

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Nancy Ide,et al.  The MULTEXT East corpus , 1998, LREC.

[14]  Piek Vossen,et al.  EuroWordNet: A multilingual database with lexical semantic networks , 1998, Springer Netherlands.

[15]  Eric Laporte,et al.  A French Corpus Annotated for Multiword Nouns , 2008, LREC 2008.

[16]  Denis Maurel,et al.  Prolex: a lexical model for translation of proper names. Application to French, Serbian and Bulgarian , 2007 .

[17]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[18]  Ranka Stankovic Improvement of Queries using a Rule Based Procedure for Inflection of Compounds and Phrases , 2008, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..

[19]  Silvie Cinková,et al.  Tectogrammatical Annotation of the Wall Street Journal , 2009, Prague Bull. Math. Linguistics.

[20]  Le An Ha,et al.  A computer-aided environment for generating multiple-choice test items , 2006, Natural Language Engineering.

[21]  Petr Pajas,et al.  TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer , 2008, WMT@ACL.

[22]  Jacques B. M. Guy An Algorithm for Identifying Cognates in Bilingual Wordlists and its Applicability to Machine Translation , 1994, J. Quant. Linguistics.

[23]  Denis Maurel,et al.  Prolexbase. Un dictionnaire relationnel multilingue de noms propres [Prolexbase: a multilingual relational dictionary of Proper Names] , 2006, TAL.

[24]  Grzegorz Kondrak,et al.  Alignment-Based Discriminative String Similarity , 2007, ACL.

[25]  Nancy Ide,et al.  Corpues enconding standard: SGML guidelines for encoding linguistic corpora , 1998, LREC.

[26]  Violetta Koseska-Toszewa,et al.  SOME PROBLEMS IN MULTILINGUAL DIGITAL DICTIONARIES , 2008 .

[27]  Martin Wynne,et al.  Developing Linguistic Corpora: a Guide to Good Practice , 2005 .

[28]  Cvetana Krstev,et al.  WS4LR: A Workstation for Lexical Resources , 2006, LREC.

[29]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[30]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[32]  Liviu P. Dinu,et al.  On the behavior of Romanian syllables related to minimum effort laws , 2009 .

[33]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984, ACL.

[34]  Cédrick Fairon,et al.  A Web-Based System for Automatic Language Skill Assessment: EVALING , 1999 .

[35]  Philipp Koehn,et al.  Learning a Translation Lexicon from Monolingual Corpora , 2002, ACL 2002.

[36]  Hiroshi Nakagawa,et al.  A Real-Time Multiple-Choice Question Generation For Language Testing: A Preliminary Study , 2005 .

[37]  May Fan,et al.  An Evaluation of an Online Bilingual Corpus for the Self-Learning of Legal English. , 2002 .

[38]  Eiichiro Sumita,et al.  Measuring Non-native Speakers’ Proficiency of English by Using a Test with Automatically-Generated Fill-in-the-Blank Questions , 2005 .

[39]  Jan Hajič,et al.  The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech , 2007, ACL 2007.

[40]  P. Sgall,et al.  Generativní popis jazyka a česká deklinace , 1967 .

[41]  Ivelina Nikolova Language Technologies for Instructional Resources in Bulgarian , 2009, ESSLLI Student Sessions.

[42]  Petr Pajas,et al.  Prague Arabic Dependency Treebank 1.0 , 2009 .

[43]  Grzegorz Kondrak,et al.  Identification of Confusable Drug Names: A New Approach and Evaluation Methodology , 2004, COLING.

[44]  Alexander F. Gelbukh,et al.  A Bilingual Corpus of Novels Aligned at Paragraph Level , 2006, FinTAL.

[45]  Maxine Eskénazi,et al.  Automatic Question Generation for Vocabulary Assessment , 2005, HLT.

[46]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.