论文信息 - Spelling Error Correction with Soft-Masked BERT

Spelling Error Correction with Soft-Masked BERT

Spelling error correction is an important yet challenging task because a satisfactory solution of it essentially needs human-level language understanding ability. Without loss of generality we consider Chinese spelling error correction (CSC) in this paper. A state-of-the-art method for the task selects a character from a list of candidates for correction (including non-correction) at each position of the sentence on the basis of BERT, the language representation model. The accuracy of the method can be sub-optimal, however, because BERT does not have sufficient capability to detect whether there is an error at each position, apparently due to the way of pre-training it using mask language modeling. In this work, we propose a novel neural architecture to address the aforementioned issue, which consists of a network for error detection and a network for error correction based on BERT, with the former being connected to the latter with what we call soft-masking technique. Our method of using `Soft-Masked BERT' is general, and it may be employed in other language detection-correction problems. Experimental results on two datasets demonstrate that the performance of our proposed method is significantly better than the baselines including the one solely based on BERT.

[1] Yuzhong Hong,et al. FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm , 2019, EMNLP.

[2] Martin Chodorow,et al. Automated Essay Scoring for Nonnative English Speakers , 1999 .

[3] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[4] Yi Tay,et al. Confusionset-guided Pointer Networks for Chinese Spelling Check , 2019, ACL.

[5] Jianpeng Hou,et al. HANSpeller: A Unified Framework for Chinese Spelling Correction , 2015, ROCLING/IJCLCLP.

[6] Xu Sun,et al. A Large Scale Ranker-Based System for Search Query Spelling Correction , 2010, COLING.

[7] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9] Jing Li,et al. A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check , 2018, EMNLP.

[10] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[11] Yuen-Hsien Tseng,et al. Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check , 2014, CIPS-SIGHAN.

[12] Hsin-Hsi Chen,et al. Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check , 2015, SIGHAN@IJCNLP.

[13] Mário J. Silva,et al. Spelling Correction for Search Engine Queries , 2004, EsTAL.

[14] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[15] Zhenghua Li,et al. Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape , 2014, CIPS-SIGHAN.

[16] Tara N. Sainath,et al. A Spelling Correction Model for End-to-end Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Andy Way,et al. Using SMT for OCR Error Correction of Historical Texts , 2016, LREC.