Automatic detection of spelling variation in historical corpus An application to build a Brazilian Portuguese spelling variants dictionary

The Historical Dictionary of Brazilian Portuguese (HDBP), the first of its kind, is based on a corpus of Brazilian Portuguese (BP) texts from the sixteenth through the eighteenth centuries (and some texts from the beginning of the nineteenth century), being developed under the sponsorship of the Brazilian funding agency CNPq (Conselho Nacional de Desenvolvimento Cientifico e Tecnologico). It is a three-year project that started in 2006 to fill a gap in Brazilian culture with a dictionary describing the vocabulary of Brazilian Portuguese from the beginning of the country’s history. The corpus totals more than 3,000 texts with approximately 7.5 million words. Our working corpus, i.e. the corpus already processed by the corpus processing system UNITEX (http://www-igm.univ-mlv.fr/~unitex/), is coded in Unicode (UTF-16) and totals 1,733 texts, 57.1 MB, and 4.9 million words. A difficulty in dealing with historical corpora to carry out lexicographic tasks is the identification of all spelling variants of a specific word, since spelling variation distorts frequency counts, a usual criterion to select dictionary entries. In our project, another challenge is to select all variants of a dictionary entry that are in the corpus to illustrate the absence of an orthographical system in the aforementioned centuries and to provide example sentences for them. This paper introduces both an approach based on transformation rules to cluster distinct spelling variations around a common form, which is not always the orthographic (or modern) form, and the choices made to build a dictionary of spelling variants of BP based on these clusters. Currently, we have forty-three rules manually developed, which generated 12,189 clusters of spelling variants, totalling 27,199 variants from our working corpus. After a careful analysis of these clusters, we adopted the DELA format to build our dictionary. The BP dictionary of spelling variants enables sophisticated searches in the historical corpus using UNITEX, giving support to build the main dictionary of the HDBP project. Moreover, the variants of a given word can be searched using an application named Dicionario we have developed to display dictionaries in DELA format. As we also use Philologic (http://philologic.uchicago.edu/index.php) to support the building of the HDPB, we carried out a comparative evaluation between our approach to cluster distinct spelling variants and AGREP (http://www.tgries.de/agrep/), which is used in Philologic to check for similar or alternative spellings. 1 University of Sao Paulo, NILC, CP 668,13560-970, Sao Carlos/SP, Brazil e-mail: rg@grad.icmc.usp.br, arnaldoc@icmc.usp.br, marcelo.muniz@gmail.com, liviacucatto@yahoo.com.br, sandra@icmc.usp.br

[1]  Dawn Archer,et al.  VARD versus WORD: A comparison of the UCREL variant detector and modern spellcheckers on English historical corpora , 2005 .

[2]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[3]  Alan Johnson,et al.  Preface , 2021, Journal of Antimicrobial Chemotherapy.

[4]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[5]  Elisabete Ranchhod O uso de dicionários e de autómatos finitos na representação lexical das línguas naturais , 2001 .

[6]  Marcelo Finger,et al.  Aprendizado de regras de substituição para normatização de textos históricos , 2004 .

[7]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[8]  Eric Laporte,et al.  UNITEX-PB, a set of flexible language resources for Brazilian Portuguese , 2005 .

[9]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[10]  Gregory R. Crane,et al.  The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[11]  Dawn Archer,et al.  Tagging Historical Corpora - the problem of spelling variation , 2006, Digital Historical Corpora.

[12]  Edie Rasmussen,et al.  Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries , 2007 .

[13]  Jeffrey A. Rydberg-Cox Automatic disambiguation of Latin abbreviations in early modern texts for humanities digital libraries , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[14]  Alexander M. Robertson,et al.  Word Variant Identification in Old French , 1997, Inf. Res..

[15]  Thorsten Trippel,et al.  Building a historical corpus for Classical Portuguese: some technological aspects , 2006, LREC.

[16]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[17]  Max Silberztein,et al.  INTEX: An FST Toolbox , 2000, Theor. Comput. Sci..