From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation

We aim to tackle the problem of spelling variations in a corpus of personal Portugese letters from the 16 th to the 20 th century. We investigated the extent to which the task of normalising Portuguese spelling can be accom plished automatically. We adapted VARD2 (Baron and Rayson, 2008), a statistical tool for normalising spelling, for use with the Portuguese language and studied its performance over four dierent time periods. Our results showed that VARD2 performed best on the older letters and worst on the most modern ones. In an extrinsic evaluation, we measured the usefulness of automatic normalisation for the linguistic task of automatic POS-tagging and showed that automatic normalisation of spelling helps improve the performance of the POS-tagger.

[1]  Klaus U. Schulz,et al.  Unsupervised Learning of Edit Distance Weights for Retrieving Historical Spelling Variations , 2007 .

[2]  Walter Daelemans,et al.  MBT : Memory Based Tagger, version 1.0, Reference Guide , 2002 .

[3]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[4]  Rafael Giusti,et al.  Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary , 2007 .

[5]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[6]  Dawn Archer,et al.  Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora , 2007 .

[7]  Charlotte Galves,et al.  Clitic Placement and the Position of Subjects in the History of European Portuguese , 2005 .

[8]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[9]  I. J. Ginneken,et al.  Romance Languages and Linguistic Theory 2003 , 2005 .

[10]  Walter Daelemans,et al.  Weigh your words - memory-based lemmatization for Middle Dutch , 2010, Lit. Linguistic Comput..

[11]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[12]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .

[13]  Vitor Rocio,et al.  Automated Creation of a Medieval Portuguese Partial Treebank , 2003 .

[14]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).