Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for the two languages reaches 97.87% and 96.30%, while full morphosyntactic tagging accuracy using a 600-tag tagset peaks at 87.72% and 85.56%, respectively. Part of speech tagging accuracies reach 97.13% and 96.46%. Results indicate that more complex methods of Croatian-to- Serbian annotation projection are not required on such dataset sizes for these particular tasks. The SETIMES.HR corpus, its resulting models and test sets are all made freely available .

[1]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[2]  Tomaz Erjavec,et al.  Improving Morphosyntactic Tagging of Slovene Language through Meta-tagging , 2008, Informatica.

[3]  Tanja Samardzic,et al.  Lemmatisation as a Tagging Task , 2012, ACL.

[4]  Duško Vitas,et al.  An Overview of Resources and Basic Tools for the Processing of Serbian Written Texts , 2003 .

[5]  Vlado Delic,et al.  Transformation-based part-of-speech tagging for Serbian language , 2009, CI 2009.

[6]  Zeljko Agic,et al.  Three Syntactic Formalisms for Data-Driven Dependency Parsing of Croatian , 2013, TSD.

[7]  Tomaz Erjavec,et al.  MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora , 2004, LREC.

[8]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[9]  Jan Snajder,et al.  Towards a Constraint Grammar Based Morphological Tagger for Croatian , 2012, TSD.

[10]  Zdravko Dovedan,et al.  Evaluating Full Lemmatization of Croatian Texts , 2009 .

[11]  Zeljko Agic,et al.  Improving Part-of-Speech Tagging Accuracy for Croatian by Morphological Analysis , 2008, Informatica.

[12]  Jan Hajič,et al.  The Best of Two Worlds: Cooperation of Statistical and Rule-Based Taggers for Czech , 2007, ACL 2007.

[13]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[14]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[15]  Zeljko Agic,et al.  Tagger voting improves morphosyntactic tagging accuracy on Croatian texts , 2010, Proceedings of the ITI 2010, 32nd International Conference on Information Technology Interfaces.

[16]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[17]  Marko Tadić Croatian Lemmatization Server , 2005 .

[18]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[19]  Tanja Samardzic,et al.  Lemmatising Serbian as Category Tagging with Bidirectional Sequence Classification , 2012, LREC.

[20]  Tomaz Erjavec,et al.  hrWaC and slWac: Compiling Web Corpora for Croatian and Slovene , 2011, TSD.

[21]  Francis M. Tyers,et al.  A rule-based machine translation system from Serbo-Croatian to Macedonian , 2012, FREEOPMT.

[22]  Saso Dzeroski,et al.  DEPARTMENT OF INTELLIGENT SYSTEMS , 2019 .

[23]  Preslav Nakov,et al.  Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian , 2012, EACL.

[24]  Anders Søgaard,et al.  Semi-supervised condensed nearest neighbor for part-of-speech tagging , 2011, ACL.

[25]  Marko Tadić,et al.  Building the Croatian Morphological Lexicon , 2003 .