A Study on Chinese Spelling Check Using Confusion Sets and N-gram Statistics

This paper proposes an automatic method to build a Chinese spelling check system. Confusion sets were expanded by using two language resources, Shuowen Jiezi and the Four-Corner codes, which improved the coverages of the confusion sets. Nine scoring functions which utilize the frequency data in the Google Ngram Datasets were proposed, where the idea of smoothing was also adopted. Thresholds were also decided in an automatic way. The final system achieved far better than our baseline system in CSC 2013 Evaluation Task.

[1]  Tsun Ku,et al.  Improve the detection of improperly used Chinese characters in students’ essays with error model , 2011 .

[2]  Sebastian Deorowicz,et al.  Correcting Spelling Errors by Modelling Their Causes , 2005 .

[3]  Yang Zhang,et al.  Exploring Distributional Similarity Based Models for Query Spelling Correction , 2006, ACL.

[4]  Tsun Ku,et al.  Reducing the False Alarm Rate of Chinese Character Error Detection and Correction , 2010, CIPS-SIGHAN.

[5]  Wei Liu,et al.  Professor or Screaming Beast? Detecting Anomalous Words in Chinese , 2008, LREC.

[6]  Marcos Zampieri,et al.  Effective Spell Checking Methods Using Clustering Algorithms , 2013, RANLP.

[7]  Ming Zhou,et al.  Improving Query Spelling Correction Using Web Search Results , 2007, EMNLP-CoNLL.

[8]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[9]  Diana Inkpen,et al.  Real-Word Spelling Correction using Google Web 1T 3-grams , 2009, EMNLP.

[10]  Dan Roth,et al.  Scaling Up Context-Sensitive Text Correction , 2001, IAAI.

[11]  Dan Roth,et al.  A Winnow-Based Approach to Context-Sensitive Spelling Correction , 1998, Machine Learning.

[12]  Roger Mitton Ordering the suggestions of a spellchecker without using context , 2009, Nat. Lang. Eng..

[13]  C.-Y. Lee,et al.  Visually and Phonologically Similar Characters in Incorrect Chinese Words: Analyses, Identification, and Applications , 2011, TALIP.

[14]  S. Verberne Context-sensitive Spell Checking Based on Word Trigram Probabilities Context-sensitive Spell Checking Based on Word Trigram Probabilities , 2002 .

[15]  Lung-Hao Lee,et al.  Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013 , 2013, SIGHAN@IJCNLP.

[16]  Lei Zhang,et al.  Automatic Detecting/Correcting Errors in Chinese Text by an Approximate Word-Matching Algorithm , 2000, ACL.

[17]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[18]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[19]  Tommi A. Pirinen,et al.  Creating and Weighting Hunspell Dictionaries as Finite-State Automata , 2010 .

[20]  Charles R. Blair,et al.  A Program for Correcting Spelling Errors , 1960, Inf. Control..

[21]  Roger Mitton,et al.  English spelling and the computer , 1995 .

[22]  Chu-Ren Huang,et al.  SINICA CORPUS : Design Methodology for Balanced Corpora , 1996, PACLIC.

[23]  Yuen-Hsien Tseng,et al.  Overview of SIGHAN 2014 Bake-off for Chinese Spelling Check , 2014, CIPS-SIGHAN.

[24]  Andrew Carlson,et al.  Memory-based context-sensitive spelling correction at web scale , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).