Spelling Error Patterns in Brazilian Portuguese

Fifty years after Damerau set up his statistics for the distribution of errors in typed texts, his findings are still used in a range of different languages. Because these statistics were derived from texts in English, the question of whether they actually apply to other languages has been raised. We address this issue through the analysis of a set of typed texts in Brazilian Portuguese, deriving statistics tailored to this language. Results show that diacritical marks play a major role, as indicated by the frequency of mistakes involving them, thereby rendering Damerau's original findings mostly unfit for spelling correction systems, although still holding them useful, should one set aside such marks. Furthermore, a comparison between these results and those published for Spanish show no statistically significant differences between both languages—an indication that the distribution of spelling errors depends on the adopted character set rather than the language itself.

[1]  Felipe Teixeira,et al.  HASCH: High Performance Automatic Spell Checker for Portuguese Texts from the Web , 2012, ICCS.

[2]  C. Anton Rytting,et al.  Spelling Correction for Dialectal Arabic Dictionary Lookup , 2011, TALIP.

[3]  Josef van Genabith,et al.  Improved Spelling Error Detection and Correction for Arabic , 2012, COLING.

[4]  Kepa Sarasola,et al.  A spelling corrector for Basque based on morphology , 1997 .

[5]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[6]  Tommi A. Pirinen,et al.  State-of-the-Art in Weighted Finite-State Spell-Checking , 2014, CICLing.

[7]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[8]  Yukino Baba,et al.  How Are Spelling Errors Generated and Corrected? A Study of Corrected and Uncorrected Spelling Errors Using Keystroke Logs , 2012, ACL.

[9]  Paul Piwek,et al.  Introducing a Corpus of Human-Authored Dialogue Summaries in Portuguese , 2013, RANLP.

[10]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[11]  Tayebeh Mosavi Miangah FarsiSpell: A spell-checking system for Persian using a large monolingual corpus , 2014, Lit. Linguistic Comput..

[12]  Alexandr Rosen,et al.  Korektor – A System for Contextual Spell-Checking and Diacritics Completion , 2012, COLING.

[13]  Sebastian Deorowicz,et al.  Correcting Spelling Errors by Modelling Their Causes , 2005 .

[14]  Benno Stein,et al.  The Impact of Spelling Errors on Patent Search , 2012, EACL.

[15]  Mohamed Al-Badrashiny,et al.  Automatic Stochastic Arabic Spelling Correction With Emphasis on Space Insertions and Deletions , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Alfredo Arnaiz,et al.  A Spell Checker for a World Language: The New Microsoft's Spanish Spell Checker , 2006, LREC.