Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene

This paper presents a method for the normalization of historical texts using a combination of weighted finite-state transducers and language models. We have extended our previous work on the normalization of dialectal texts and tested the method against a 17th century literary work in Basque. This preprocessed corpus is made available in the LREC repository. The performance of this method for learning relations between historical and contemporary word forms is evaluated against resources in three languages. The method we present learns to map phonological changes using a noisy channel model. The model is based on techniques commonly used for phonological inference and producing Grapheme-to-Grapheme conversion systems encoded as weighted transducers and produces F-scores above 80% in the task for Basque. A wider evaluation shows that the approach performs equally well with all the languages in our evaluation suite: Basque, Spanish and Slovene. A comparison against other methods that address the same task is also provided.

[1]  Xavier Carreras,et al.  FreeLing: An Open-Source Suite of Language Analyzers , 2004, LREC.

[2]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[3]  Esslli Site,et al.  Natural Language Processing for Historical Texts , 2012 .

[4]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[5]  Yves Scherrer,et al.  Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction , 2007, ACL.

[6]  Gorka Labaka,et al.  Una Cascada de Transductores Simples para Normalizar Tweets , 2013, Tweet-Norm@SEPLN.

[7]  Javier Gómez,et al.  Edit transducers for spelling variation in Old Spanish , 2013 .

[8]  Iñaki Alegria,et al.  Porting Basque Morphological Grammars to foma, an Open-Source Tool , 2009, FSMNLP.

[9]  Eva Pettersson,et al.  Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction , 2016 .

[10]  Walter Daelemans,et al.  Weigh your words - memory-based lemmatization for Middle Dutch , 2010, Lit. Linguistic Comput..

[11]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[12]  Keikichi Hirose,et al.  WFST-Based Grapheme-to-Phoneme Conversion: Open Source tools for Alignment, Model-Building and Decoding , 2012, FSMNLP.

[13]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[14]  Iñaki Alegria,et al.  Learning word-level dialectal variation as phonological replacement rules using a limited parallel corpus , 2011 .

[15]  David Yarowsky,et al.  Multipath Translation Lexicon Induction via Bridge Languages , 2001, NAACL.

[16]  Grzegorz Kondrak,et al.  Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion , 2007, NAACL.

[17]  Iñaki Alegria,et al.  Learning to map variation-standard forms using a limited parallel corpus and the standard morphology , 2014, Proces. del Leng. Natural.

[18]  José João Almeida,et al.  Bigorna – A Toolkit for Orthography Migration Challenges , 2010, LREC.

[19]  Bryan Jurish,et al.  Comparing Canonicalizations of Historical German Text , 2010, SIGMORPHON.

[20]  Yves Scherrer,et al.  Modernising historical Slovene words , 2015, Natural Language Engineering.

[21]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..