POS-tagging of Historical Dutch

We present a study of the adequacy of current methods that are used for POS-tagging historical Dutch texts, as well as an exploration of the influence of employing different techniques to improve upon the current practice. The main focus of this paper is on (unsupervised) methods that are easily adaptable for different domains without requiring extensive manual input. It was found that modernising the spelling of corpora prior to tagging them with a tagger trained on contemporary Dutch results in a large increase in accuracy, but that spelling normalisation alone is not sufficient to obtain state-of-the-art results. The best results were achieved by training a POS-tagger on a corpus automatically annotated by projecting (automatically assigned) POS-tags via word alignments from a contemporary corpus. This result is promising, as it was reached without including any domain knowledge or context dependencies. We argue that the insights of this study combined with semi-supervised learning techniques for domain adaptation can be used to develop a general-purpose diachronic tagger for Dutch.

[1]  Yoav Goldberg,et al.  EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start) , 2008, ACL.

[2]  Hans van Halteren,et al.  Dealing with orthographic variation in a tagger-lemmatizer for fourteenth century Dutch charters , 2013, Lang. Resour. Evaluation.

[3]  Jason Baldridge,et al.  Part-of-Speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts , 2007, EMNLP-CoNLL.

[4]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[5]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[6]  Yi Yang,et al.  Part-of-Speech Tagging for Historical English , 2016, NAACL.

[7]  Khalil Sima'an,et al.  Learning Structural Dependencies of Words in the Zipfian Tail , 2011, J. Log. Comput..

[8]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[9]  Nelleke Oostdijk,et al.  Het Corpus Gesproken Nederlands , 1999 .

[10]  Yi Yang,et al.  Unsupervised Multi-Domain Adaptation with Feature Embeddings , 2015, NAACL.

[11]  Suléne Pilon,et al.  Rule-based conversion of closely-related languages: a Dutch-to-Afrikaans convertor , 2009 .

[12]  Iris Hendrickx,et al.  From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation , 2011, J. Lang. Technol. Comput. Linguistics.

[13]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[14]  Thomas L. Griffiths,et al.  A fully Bayesian approach to unsupervised part-of-speech tagging , 2007, ACL.

[15]  David Elworthy,et al.  Does Baum-Welch Re-estimation Help Taggers? , 1994, ANLP.

[16]  V. D. Wal,et al.  Letters as loot. Confiscated letters filling major gaps in the history of Dutch , 2012 .

[17]  Jason Baldridge,et al.  Learning a Part-of-Speech Tagger from Two Hours of Annotation , 2013, NAACL.

[18]  Iris Hendrickx,et al.  Historical spelling normalization. A comparison of two statistical methods : TICCL and VARD2 , 2012 .

[19]  Eric Brill,et al.  Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging , 1995, VLC@ACL.

[20]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[21]  Willem Ysbrantsz. Bontekoe Journael oft gedenckwaerdige beschrijvinghe van de Oost-Indische reijse , 1915 .

[22]  Philip Resnik,et al.  Evaluating Translational Correspondence using Annotation Projection , 2002, ACL.

[23]  Emanuele Pianta,et al.  Evaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus , 2004, COLING.

[24]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[25]  Dawn Archer,et al.  Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .