Deep Learning-Based Context-Sensitive Spelling Typing Error Correction

This study aims to solve the context-sensitive spelling error problem for English documents. There are two types of spelling errors in English: non-word spelling errors and context-sensitive spelling errors. Non-word spelling errors are simple to correct because they can only be detected by matching the words in sentences with those in a dictionary; however, context-sensitive spelling errors entail increased difficulty of correction because the relationship between the word to be corrected and the surrounding context must be known. Spelling errors are considered noise in every field that uses text information, and preprocessing via document correction is necessary to minimize this problem. Context-sensitive spelling errors include homophone errors (which arise from the incorrect use of words that sound the same but are spelled differently), typographical errors (caused by striking an incorrect key on a keyboard), grammatical errors (which occur when the user does not know the correct grammatical rules), and cross word boundary errors (which arise from incorrect spacing between words). This study focuses on typographical errors. The context-sensitive spelling error problem is solved using the deep learning method, which is not an existing statistical method. The deep learning language model-based correction approach is divided into four parts, namely, correction based on word embedding information, contextual embedding information, an auto-regressive (AR) language model, and an auto-encoding (AE) language model. In this study, the best correction performance was obtained for the AE language model-based approach, and we verified its performance through a detailed correction test.

[1]  Minho Kim,et al.  Context-sensitive Spelling Error Correction using Eojeol N-gram , 2014 .

[2]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[3]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[4]  Kenneth Ward Church,et al.  Probability scoring for spelling correction , 1991 .

[5]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[6]  Sylviane Granger,et al.  Categorising spelling errors to assess L2 writing , 2011 .

[7]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[11]  Caroline M. Eastman,et al.  An analysis of ill-formed input in natural language queries to document retrieval systems , 1991, Inf. Process. Manag..

[12]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[13]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[14]  Yuchen Li,et al.  Context-Sensitive Malicious Spelling Error Correction , 2019, WWW.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  Yang Wang,et al.  Spelling Error Correction Using a Nested RNN Model and Pseudo Training Data , 2018, ArXiv.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Jung-Hun Lee,et al.  Improved Statistical Language Model for Context-sensitive Spelling Error Candidates , 2017 .

[21]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[22]  Minho Kim,et al.  Adaptive Context-Sensitive Spelling Error Correction Techniques for the Extremely Unpredictable Error Generating Language Environments , 2015, 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing.