Machine Learning and Deep Neural Network-Based Lemmatization and Morphosyntactic Tagging for Serbian

The training of new tagger models for Serbian is primarily motivated by the enhancement of the existing tagset with the grammatical category of a gender. The harmonization of resources that were manually annotated within different projects over a long period of time was an important task, enabled by the development of tools that support partial automation. The supporting tools take into account different taggers and tagsets. This paper focuses on TreeTagger and spaCy taggers, and the annotation schema alignment between Serbian morphological dictionaries, MULTEXT-East and Universal Part-of-Speech tagset. The trained models will be used to publish the new version of the Corpus of Contemporary Serbian as well as the Serbian literary corpus. The performance of developed taggers were compared and the impact of training set size was investigated, which resulted in around 98% PoS-tagging precision per token for both new models. The sr_basic annotated dataset will also be published.

[1]  Lluís Màrquez i Villodre,et al.  SVMTool: A general POS Tagger Generator Based on Support Vector Machines , 2004, LREC.

[2]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[3]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[4]  Jinho D. Choi Dynamic Feature Induction: The Last Gist to the State-of-the-Art , 2016, NAACL.

[5]  Claudia Soria,et al.  Language Resources Production Models: the Case of the INTERA Multilingual Corpus and Terminology , 2006, LREC.

[6]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[7]  Serge Heiden,et al.  The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme , 2010, PACLIC.

[8]  Pascal Denis,et al.  Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort , 2009, PACLIC.

[9]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[10]  Tomaz Erjavec,et al.  Morpho-Syntactic Descriptions in MULTEXT-East - the Case of Serbian , 2004, Informatica.

[11]  Tomaž Erjavec,et al.  BUILDING LANGUAGE RESOURCES AND TRANSLATION MODELS FOR MACHINE TRANSLATION FOCUSED ON SOUTH SLAVIC AND BALKAN LANGUAGES , 2008 .

[12]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[13]  Cvetana Krstev,et al.  A system for named entity recognition based on local grammars , 2014, J. Log. Comput..

[14]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[15]  Cvetana Krstev,et al.  Lexical Analysis of Serbian with Conditional Random Fields and Large-Coverage Finite-State Resources , 2015, LTC.

[16]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .