Spelling Error Correction with Soft-Masked BERT

Spelling error correction is an important yet challenging task because a satisfactory solution of it essentially needs human-level language understanding ability. Without loss of generality we consider Chinese spelling error correction (CSC) in this paper. A state-of-the-art method for the task selects a character from a list of candidates for correction (including non-correction) at each position of the sentence on the basis of BERT, the language representation model. The accuracy of the method can be sub-optimal, however, because BERT does not have sufficient capability to detect whether there is an error at each position, apparently due to the way of pre-training it using mask language modeling. In this work, we propose a novel neural architecture to address the aforementioned issue, which consists of a network for error detection and a network for error correction based on BERT, with the former being connected to the latter with what we call soft-masking technique. Our method of using `Soft-Masked BERT' is general, and it may be employed in other language detection-correction problems. Experimental results on two datasets demonstrate that the performance of our proposed method is significantly better than the baselines including the one solely based on BERT.

[1]  Yuzhong Hong,et al.  FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm , 2019, EMNLP.

[2]  Martin Chodorow,et al.  Automated Essay Scoring for Nonnative English Speakers , 1999 .

[3]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[4]  Yi Tay,et al.  Confusionset-guided Pointer Networks for Chinese Spelling Check , 2019, ACL.

[5]  Jianpeng Hou,et al.  HANSpeller: A Unified Framework for Chinese Spelling Correction , 2015, ROCLING/IJCLCLP.

[6]  Xu Sun,et al.  A Large Scale Ranker-Based System for Search Query Spelling Correction , 2010, COLING.

[7]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Jing Li,et al.  A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check , 2018, EMNLP.

[10]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[11]  Yuen-Hsien Tseng,et al.  Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check , 2014, CIPS-SIGHAN.

[12]  Hsin-Hsi Chen,et al.  Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check , 2015, SIGHAN@IJCNLP.

[13]  Mário J. Silva,et al.  Spelling Correction for Search Engine Queries , 2004, EsTAL.

[14]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[15]  Zhenghua Li,et al.  Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape , 2014, CIPS-SIGHAN.

[16]  Tara N. Sainath,et al.  A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Andy Way,et al.  Using SMT for OCR Error Correction of Historical Texts , 2016, LREC.