Context-Based Spelling Correction for the Dutch Language: Applied on spelling errors extracted from the Dutch Wikipedia revision history

In this thesis we did research on context-based spellchecking approaches for the Dutch language. Context-based approaches enable the detection of real-word spelling errors by using the context in which the errors occur. We also assessed if we could improve the ranking of replacement candidates by using the context. To be able to measure the performance of the different techniques used, a dataset containing erroneous-corrected sentence pairs was obtained from the Dutch Wikipedia revision history. This dataset contains a wide variety of human generated spelling errors, and consists of over 1.4 million instances. It can serve as a basis for further research. The obtained dataset showed to be a valuable source for the creation of an error model, with which we could improve the ranking of candidate replacement words. This model takes the character context in which erroneous edit operations occur into account, and therefore reflects what kind of edit operations are more likely to occur. The spellchecking results using our dataset show that the context-based approach used, works for both the detection of errors and the ranking of candidate replacements. A comparison with literature was made to assess if the technique used performs as good for Dutch as for English and we conclude that the performance is comparable. The error model trained on our dataset was shown to work better than the context-based approach for the task of candidate ranking.

[1]  Oliver Ferschke,et al.  Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia’s Edit History , 2011, ACL.

[2]  Davide Fossati,et al.  A Mixed Trigrams Approach for Context Sensitive Spell Checking , 2009, CICLing.

[3]  Fred J. Damerau,et al.  An examination of undetected typing errors , 1989, Inf. Process. Manag..

[4]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[5]  Frank Van Eynde,et al.  Large Scale Syntactic Annotation of Written Dutch: Lassy , 2013, Essential Speech and Language Technology for Dutch.

[6]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[7]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[8]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Guillaume Wisniewski,et al.  Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History , 2022, LREC.

[11]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[12]  Diana Inkpen,et al.  Real-word spelling correction using Google web 1Tn-gram data set , 2009, CIKM.

[13]  Torsten Zesch,et al.  Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History , 2012, EACL.

[14]  Eric Brill,et al.  Automatic Rule Acquisition for Spelling Correction , 1997, ICML.

[15]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[16]  Graeme Hirst,et al.  Correcting real-word spelling errors by restoring lexical cohesion , 2005, Natural Language Engineering.

[17]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[18]  C.P.P. Deloo,et al.  Using semantic relatedness to improve the evaluation of multi-label classifiers , 2013 .

[19]  Gerard Kempen,et al.  A Language-Sensitive Text Editor for Dutch , 1992 .

[20]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[21]  Ian H. Witten,et al.  Mining Meaning from Wikipedia , 2008, Int. J. Hum. Comput. Stud..

[22]  Andrew R. Golding,et al.  A Bayesian Hybrid Method for Context-sensitive Spelling Correction , 1996, VLC@ACL.

[23]  Eric Brill,et al.  An Improved Error Model for Noisy Channel Spelling Correction , 2000, ACL.

[24]  Nitin Madnani,et al.  Robust Systems for Preposition Error Correction Using Wikipedia Revisions , 2013, NAACL.