Detecting and correcting spelling errors in high-quality Dutch Wikipedia text

For the CLIN28 shared task, we evaluated systems for spelling correction of high-quality text. The task focused on detecting and correcting spelling errors in Dutch Wikipedia pages. Three teams took part in the task. We compared the performance of their systems to that of a baseline system, the Dutch spelling corrector Valkuil. We evaluated the systems’ performance in terms of F1 score. Although two of the three participating systems performed well in the task of correcting spelling errors, error detection proved to be a challenging task, and without exception resulted in a high false positive rate. Therefore, the F1 score of the baseline was not improved upon. This paper elaborates on each team’s approach to the task, and discusses the overall challenges of correcting high-quality text.

[1]  Martin Reynaert,et al.  Text Induced Spelling Correction , 2004, COLING.

[2]  Peter F. MacNeilage,et al.  Typing Errors as Clues to Serial Ordering Mechanisms in Language Behaviour , 1964 .

[3]  P. J. Berck,et al.  Memory-Based Text Correction , 2017 .

[4]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[5]  Martin Reynaert,et al.  FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study , 2014, CLIN 2014.

[6]  Anders Søgaard,et al.  Improving historical spelling normalization with bi-directional LSTMs and multi-task learning , 2016, COLING.

[7]  H. Ng,et al.  A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction , 2018, AAAI.

[8]  L. J. Tijhuis Context-Based Spelling Correction for the Dutch Language: Applied on spelling errors extracted from the Dutch Wikipedia revision history , 2014 .

[9]  Antal van den Bosch,et al.  Efficient n-gram, Skipgram and Flexgram Modelling with Colibri Core , 2016 .

[10]  Antal van den Bosch,et al.  Memory-based Grammatical Error Correction , 2013, CoNLL Shared Task.

[11]  Christian Bauckhage,et al.  KPCA Embeddings: An Unsupervised Approach to Learn Vector Representations of Finite Domain Sequences , 2017, LWDA.

[12]  Martin Reynaert Character confusion versus focus word-based correction of spelling and OCR variants in corpora , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[13]  Per Ola Kristensson,et al.  Neural Networks for Text Correction and Completion in Keyboard Decoding , 2017, ArXiv.

[14]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[15]  Jonathan T. Grudin,et al.  Error Patterns in Novice and Skilled Transcription Typing , 1983 .

[16]  Marie-Francine Moens,et al.  Automatic detection and correction of context-dependent dt-mistakes using neural networks , 2018 .

[17]  Ludo Permentier,et al.  Het Groene Boekje : woordenlijst Nederlandse taal , 2015 .

[18]  Daniel Jurafsky,et al.  Neural Language Correction with Character-Based Attention , 2016, ArXiv.

[19]  Walter Daelemans,et al.  Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings , 2017, ArXiv.

[20]  Martin Reynaert TICCLops: Text-Induced Corpus Clean-up as online processing system , 2014, COLING.

[21]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.