Integrating Dictionary and Web N-grams for Chinese Spell Checking

Chinese spell checking is an important component of many NLP applications, including word processors, search engines, and automatic essay rating. Nevertheless, compared to spell checkers for alphabetical languages (e.g., English or French), Chinese spell checkers are more difficult to develop because there are no word boundaries in the Chinese writing system and errors may be caused by various Chinese input methods. In this paper, we propose a novel method for detecting and correcting Chinese typographical errors. Our approach involves word segmentation, detection rules, and phrase-based machine translation. The error detection module detects errors by segmenting words and checking word and phrase frequency based on compiled and Web corpora. The phonological or morphological typographical errors found then are corrected by running a decoder based on the statistical machine translation model (SMT). The results show that the proposed system achieves significantly better accuracy in error detection and more satisfactory performance in error correction than the state-of-the-art systems.

[1]  Andreas Stolcke,et al.  SRILM at Sixteen: Update and Outlook , 2011 .

[2]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[3]  Mei-Chen Wu,et al.  Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text , 2007, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[4]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[5]  Tsun Ku,et al.  Reducing the False Alarm Rate of Chinese Character Error Detection and Correction , 2010, CIPS-SIGHAN.

[6]  C.-Y. Lee,et al.  Visually and Phonologically Similar Characters in Incorrect Chinese Words: Analyses, Identification, and Applications , 2011, TALIP.

[7]  Qiang Zhou,et al.  A hybrid approach to automatic Chinese text checking and error correction , 2001, 2001 IEEE International Conference on Systems, Man and Cybernetics. e-Systems and e-Man for Cybernetics in Cyberspace (Cat.No.01CH37236).

[8]  Tao Lin,et al.  A rule based Chinese spelling and grammar detection system utility , 2012, 2012 International Conference on System Science and Engineering (ICSSE).

[9]  Chao-Lin Liu,et al.  Phonological and Logographic Influences on Errors in Written Chinese Words , 2009, ALR7@IJCNLP.

[10]  Keh-Jiann Chen,et al.  Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[11]  Chunheng Wang,et al.  A Chinese OCR spelling check approach based on statistical language models , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.