E-magyar - A Digital Language Processing System

e-magyar is a new toolset for the analysis of Hungarian texts. It was produced as a collaborative effort of the Hungarian language technology community integrating the best state-of-the-art tools, enhancing them where necessary, making them interoperable and releasing them with a clear license. It is a free, open, modular text processing pipeline which is integrated in the GATE system offering further prospects of interoperability. From tokenizing to parsing and named entity recognition, existing tools were examined and those selected for integration underwent various amount of overhaul in order to operate in the pipeline with a uniform encoding, and run in the same Java platform. The tokenizer was re-built from ground up and the flagship module, the morphological analyzer, based on the Humor system (Prószéky and Kis, 1999), was given a new annotation system and was implemented in the HFST framework (Lindén et al., 2009). The system is aimed for a broad range of users, from language technology application developers to digital humanities researchers alike. It comes with a drag-and-drop demo on its website: http://e-magyar.hu/en/.

[1]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.

[2]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[3]  Veronika Vincze,et al.  Universal Dependencies and Morphology for Hungarian - and on the Price of Universality , 2017, EACL.

[4]  Milyen a jó Humor? , 2006 .

[5]  Eszter Simon,et al.  Approaches to Hungarian Named Entity Recognition , 2013 .

[6]  Attila Novák A New Form of Humor ― Mapping Constraint-Based Computational Morphologies to a Finite-State Representation , 2014, LREC.

[7]  Balázs Kis,et al.  A Unification-based Approach to Morpho-syntactic Parsing of Agglutinative and Other (Highly) Inflectional Languages , 1999, ACL.

[8]  Péter Rebrus,et al.  Morphdb.hu: Hungarian lexical database and morphological grammar , 2006, LREC.

[9]  Attila Novák,et al.  Model of computational morphology and its application to uralic languages , 2015 .

[10]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[11]  Bernd Bohnet,et al.  Top Accuracy and Fast Dependency Parsing is not a Contradiction , 2010, COLING.

[12]  Veronika Vincze,et al.  magyarlanc: A Tool for Morphological and Dependency Parsing of Hungarian , 2013, RANLP.

[13]  Attila Novák,et al.  PurePos 2.0: a hybrid tool for morphological disambiguation , 2013, RANLP.

[14]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[15]  János Csirik,et al.  A highly accurate Named Entity corpus for Hungarian , 2006, LREC.

[16]  András Kornai,et al.  Creating Open Language Resources for Hungarian , 2004, LREC.

[17]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[18]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[19]  Richárd Farkas,et al.  Special Techniques for Constituent Parsing of Morphologically Rich Languages , 2014, EACL.

[20]  János Csirik,et al.  The Szeged Treebank , 2005, TSD.

[21]  János Csirik,et al.  Hungarian Dependency Treebank , 2010, LREC.

[22]  Balázs Indig,et al.  HunTag3, a general-purpose, modular sequential tagger - chunking phrases in English and maximal NPs and NER for Hungarian , 2015 .