SpellBERT: A Lightweight Pretrained Model for Chinese Spelling Check

Chinese Spelling Check (CSC) is to detect and correct Chinese spelling errors. Many models utilize a predefined confusion set to learn a mapping between correct characters and its visually similar or phonetically similar misuses but the mapping may be out-of-domain. To that end, we propose SpellBERT, a pretrained model with graph-based extra features and independent on confusion set. To explicitly capture the two erroneous patterns, we employ a graph neural network to introduce radical and pinyin information as visual and phonetic features. For better fusing these features with character representations, we devise masked language model alike pre-training tasks. With this feature-rich pre-training, SpellBERT with only half size of BERT can show competitive performance and make a state-of-the-art result on the OCR dataset where most of the errors are not covered by the existing confusion set 1.

[1]  Minchuan Chen,et al.  PHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Check , 2021, ACL.

[2]  Haoran Huang,et al.  Spelling Error Correction with Soft-Masked BERT , 2020, ACL.

[3]  Hsin-Hsi Chen,et al.  Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check , 2015, SIGHAN@IJCNLP.

[4]  Xuanjing Huang,et al.  Exploration and Exploitation: Two Ways to Improve Chinese Spelling Correction Models , 2021, ACL.

[5]  Heyan Huang,et al.  Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking , 2021, FINDINGS.

[6]  Yuen-Hsien Tseng,et al.  Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check , 2014, CIPS-SIGHAN.

[7]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[8]  Keh-Jiann Chen,et al.  Introduction to CKIP Chinese Spelling Check System for SIGHAN Bakeoff 2013 Evaluation , 2013, SIGHAN@IJCNLP.

[9]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[10]  Feng Zhang,et al.  PLOME: Pre-training with Misspelled Knowledge for Chinese Spelling Correction , 2021, ACL.

[11]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12]  Chen Li,et al.  Chunk-based Chinese Spelling Check with Global Optimization , 2020, EMNLP.

[13]  Chao-Lin Liu,et al.  Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words , 2010, COLING.

[14]  Alex Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Yuzhong Hong,et al.  FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm , 2019, EMNLP.

[17]  Yi Tay,et al.  Confusionset-guided Pointer Networks for Chinese Spelling Check , 2019, ACL.

[18]  Zhenghua Li,et al.  Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape , 2014, CIPS-SIGHAN.

[19]  Wei Chu,et al.  SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check , 2020, ACL.

[20]  Nancy F. Chen,et al.  Adaptable Filtering using Hierarchical Embeddings for Chinese Spell Check , 2020, ArXiv.

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  Jing Li,et al.  A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check , 2018, EMNLP.

[23]  Lung-Hao Lee,et al.  Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013 , 2013, SIGHAN@IJCNLP.

[24]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[25]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[26]  Zheng Zhang,et al.  CoLAKE: Contextualized Language and Knowledge Embedding , 2020, COLING.

[27]  Hua Wu,et al.  Correcting Chinese Spelling Errors with Phonetic Pre-training , 2021, FINDINGS.

[28]  Ting Liu,et al.  CharBERT: Character-aware Pre-trained Language Model , 2020, COLING.