An automatic part-of-speech tagger for Middle Low German

Syntactically annotated corpora are highly important for enabling large-scale diachronic and diatopic language research. Such corpora have recently been developed for a variety of historical languages, or are still under development. One of those under development is the fully tagged and parsed Corpus of Historical Low German (CHLG), which is aimed at facilitating research into the highly under-researched diachronic syntax of Low German. The present paper reports on a crucial step in creating the corpus, viz. the creation of a part-of-speech tagger for Middle Low German (MLG). Having been transmitted in several non-standardised written varieties, MLG poses a challenge to standard POS taggers, which usually rely on normalized spelling. We outline the major issues faced in the creation of the tagger and present our solutions to them.

[1]  Bernhard Schröder,et al.  FndhC/HTML und FnhdC/S , 2007 .

[2]  Paul Bennett,et al.  A Gold Standard Corpus of Early Modern German , 2011, Linguistic Annotation Workshop.

[3]  A. Kroch,et al.  The Middle English Verb-Second Constraint: A case study in language contact and languagechange , 2001 .

[4]  Christian Chiarcos,et al.  Combining Ontologies and Neural Networks for Analyzing Historical Language Varieties. A Case Study in Middle Low German , 2016, LREC.

[5]  Mikko Kurimo,et al.  Part-of-Speech Tagging using Conditional Random Fields: Exploiting Sub-Label Dependencies for Improved Accuracy , 2014, ACL.

[6]  Robert Peters,et al.  Das digitale ‚Referenzkorpus Mittelniederdeutsch / Niederrheinisch (ReN)‘ , 2014 .

[7]  Joakim Nivre,et al.  Normalisation of Historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting , 2013, NODALIDA.

[8]  Els Lefever,et al.  LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit , 2013, CLIN 2013.

[9]  P. Bennett,et al.  Annotating a historical corpus of German : A case study , 2010 .

[10]  Robert Peters,et al.  Der ,Atlas spätmittelalterlicher Schreibsprachen des niederdeutschen Altlandes und angrenzender Gebiete‘ (ASnA) , 2007 .

[11]  Eiríkur Rögnvaldsson,et al.  Morphosyntactic Tagging of Old Icelandic Texts and Its Use in Studying Syntactic Variation and Change , 2011, Language Technology for Cultural Heritage.

[12]  Walter Daelemans,et al.  Multimodular Text Normalization of Dutch User-Generated Content , 2016, ACM Trans. Intell. Syst. Technol..

[13]  Yi Yang,et al.  Part-of-Speech Tagging for Historical English , 2016, NAACL.

[14]  A. P. B. Sardinha Corpus linguistics - investigating language structure and use , 1999 .

[15]  Gerold Schneider,et al.  Parsing early and late modern English corpora , 2015, Digit. Scholarsh. Humanit..

[16]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[17]  Stefanie Dipper Annotierte Korpora für die Historische Syntaxforschung: Anwendungsbeispiele anhand des Referenzkorpus Mittelhochdeutsch , 2015 .

[18]  Joakim Nivre,et al.  A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text , 2014, LaTeCH@EACL.

[19]  Kurt Braunmüller 115. Language contact during the Old Nordic period I: with the British Isles, Frisia and the Hanseatic League , 2017 .

[20]  Ann Taylor,et al.  York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) , 2014 .

[21]  Andreas Bieberstedt Variablenlinguistische Beobachtungen zu den mittelniederdeutschen Schreibsprachen des südlichen Ostseeraumes am Beispiel von Wismar und Stralsund , 2015 .

[22]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[23]  Walter Daelemans,et al.  Forgetting Exceptions is Harmful in Language Learning , 1998, Machine Learning.

[24]  Helmut Schmid,et al.  Estimation of Conditional Probabilities With Decision Trees and an Application to Fine-Grained POS Tagging , 2008, COLING.

[25]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[26]  Ingrid Schröder Das Referenzkorpus: Neue Perspektiven für die mittelniederdeutsche Grammatikographie , 2014 .

[27]  Fabian Barteld,et al.  Unsupervised regularization of historical texts for POS tagging , 2016 .

[28]  Dawn Archer,et al.  Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .

[29]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[30]  Jason Baldridge,et al.  Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts , 2007, EMNLP-CoNLL.

[31]  George Walkden,et al.  The HeliPaD: A parsed corpus of Old Saxon , 2016 .

[32]  Kurt Braunmüller Forms of Language Contact in the Area of the Hanseatic League: Dialect Contact Phenomena and Semicommunication , 1996 .

[33]  Stefanie Dipper,et al.  CorA: A web-based annotation tool for historical and other non-standard language data , 2014, LaTeCH@EACL.

[34]  Paul Bennett,et al.  Evaluating an ‘off-the-shelf’ POS-tagger on Early Modern German text , 2011, LaTeCH@ACL.