Evaluating Off-the-Shelf NLP Tools for German

It is not always easy to keep track of what tools are currently available for a particular annotation task, nor is it obvious how the provided models will perform on a given data set. In this contribution, we provide an overview of the tools available for the automatic annotation of German-language text. We evaluate fifteen free and open source NLP tools for the linguistic annotation of German, looking at the fundamental NLP tasks of sentence segmentation, tokenization, POS tagging, morphological analysis, lemmatization, and dependency parsing. To get an idea of how the systems’ performance will generalize to various domains, we compiled our test corpus from various non-standard domains. All of the systems in our study are evaluated not only with respect to accuracy, but also the computational resources required.

[1]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[2]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[3]  Helmut Schmid,et al.  Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging , 2008, COLING.

[4]  Chris Dyer,et al.  Part-of-Speech Tagging for Twitter : Word Clusters and Other Advances , 2012 .

[5]  Stefan Conrad,et al.  IWNLP: Inverse Wiktionary for Natural Language Processing , 2015, ACL.

[6]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[7]  Thomas Proisl SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts , 2018, LREC.

[8]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[9]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[10]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[11]  Timothy Dozat,et al.  Universal Dependency Parsing from Scratch , 2019, CoNLL.

[12]  Gerold Schneider,et al.  Exploiting Synergies Between Open Resources for German Dependency Parsing, POS-tagging, and Morphological Analysis , 2013, RANLP.

[13]  Tanveer A. Faruquie,et al.  Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results , 2011, MOCR_AND '11.

[14]  Iryna Gurevych,et al.  A Web-based Tool for the Integrated Annotation of Semantic and Syntactic Structures , 2016, LT4DH@COLING.

[15]  Rudolf Mathar,et al.  A POS Tagger for Social Media Texts Trained on Web Comments , 2013, Polibits.

[16]  Douglas Biber,et al.  Register, Genre, and Style , 2019 .

[17]  Stefan Evert,et al.  EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora , 2016, WAC@ACL.

[18]  Christian Chiarcos,et al.  A New Hybrid Dependency Parser for German , 2009 .

[19]  Walter Daelemans,et al.  Pattern for Python , 2012, J. Mach. Learn. Res..

[20]  Thomas Proisl,et al.  SoMaJo: State-of-the-art tokenization for German web and social media texts , 2016, WAC@ACL.

[21]  Joel R. Tetreault,et al.  It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool , 2015, ACL.