Combining Phonology and Morphology for the Normalization of Historical Texts

This paper presents a proposal for the normalization of word-forms in historical texts. To perform this task, we extend our previous research on induction of phonology and adapt it to the task of normalization. In particular, we combine our earlier models with models for learning morphology (without additional supervision). The results are mixed: induction of the segmentation of morphemes fails to directly offer significant improvements while including known morpheme boundaries in standard texts do improve results.

[1]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[2]  Bryan Jurish,et al.  Comparing Canonicalizations of Historical German Text , 2010, SIGMORPHON.

[3]  Joakim Nivre,et al.  A Multilingual Evaluation of Three Spelling Normalisation Methods for Historical Text , 2014, LaTeCH@EACL.

[4]  Iñaki Alegria,et al.  Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene , 2016, LREC.

[5]  Delphine Bernhard,et al.  Unsupervised Morphological Segmentation Based on Segment Predictability and Word Segments Alignment , 2009 .

[6]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[7]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[8]  Yves Scherrer,et al.  Modernising historical Slovene words , 2015, Natural Language Engineering.

[9]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[10]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[11]  Walter Daelemans,et al.  Weigh your words - memory-based lemmatization for Middle Dutch , 2010, Lit. Linguistic Comput..

[12]  Keikichi Hirose,et al.  WFST-Based Grapheme-to-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding , 2012, FSMNLP.

[13]  Lars Borin,et al.  Unsupervised Learning of Morphology , 2011, CL.

[14]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[15]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[16]  Javier Gómez,et al.  Edit transducers for spelling variation in Old Spanish , 2013 .