FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm

We propose a Chinese spell checker – FASPell based on a new paradigm which consists of a denoising autoencoder (DAE) and a decoder. In comparison with previous stateof-the-art models, the new paradigm allows our spell checker to be Faster in computation, readily Adaptable to both simplified and traditional Chinese texts produced by either humans or machines, and to require much Simpler structure to be as much Powerful in both error detection and correction. These four achievements are made possible because the new paradigm circumvents two bottlenecks. First, the DAE curtails the amount of Chinese spell checking data needed for supervised learning (to <10k sentences) by leveraging the power of unsupervisedly pre-trained masked language model as in BERT, XLNet, MASS etc. Second, the decoder helps to eliminate the use of confusion set that is deficient in flexibility and sufficiency of utilizing the salient feature of Chinese character similarity.

[1]  Hsin-Hsi Chen,et al.  Introduction to SIGHAN 2015 Bake-off for Chinese Spelling Check , 2015, SIGHAN@IJCNLP.

[2]  Jui-Feng Yeh,et al.  Chinese Word Spelling Correction Based on N-gram Ranked Inverted Index List , 2013, SIGHAN@IJCNLP.

[3]  Jing Li,et al.  A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check , 2018, EMNLP.

[4]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[5]  Deng Cai,et al.  A Hybrid Model for Chinese Spelling Check , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[6]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[7]  Xiang Bai,et al.  An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Hai Zhao,et al.  Spell Checking for Chinese , 2012, LREC.

[9]  Nikolaus Augsten,et al.  Tree edit distance: Robust and memory-efficient , 2016, Inf. Syst..

[10]  David Eppstein,et al.  Finding the k Shortest Paths , 1999, SIAM J. Comput..

[11]  Lung-Hao Lee,et al.  Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013 , 2013, SIGHAN@IJCNLP.

[12]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[13]  Chao-Lin Liu,et al.  Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words , 2010, COLING.

[14]  Kam-Fai Wong,et al.  NLPTEA 2017 Shared Task - Chinese Spelling Check , 2017, NLP-TEA@IJCNLP.

[15]  Jianpeng Hou,et al.  HANSpeller: A Unified Framework for Chinese Spelling Correction , 2015, ROCLING/IJCLCLP.

[16]  Yuen-Hsien Tseng,et al.  Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check , 2014, CIPS-SIGHAN.

[17]  Zhenghua Li,et al.  Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape , 2014, CIPS-SIGHAN.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Xiang Tong,et al.  A Statistical Approach to Automatic OCR Error Correction in Context , 1996, VLC@COLING.

[20]  Nikolaus Augsten,et al.  Efficient Computation of the Tree Edit Distance , 2015, TODS.