论文信息 - Improvement of Korean Proofreading System Using Corpus and Collocation Rules

Improvement of Korean Proofreading System Using Corpus and Collocation Rules

This paper presents the 'techniques of correcting for spelling errors, orthographical errors, and grammatical errors in computer-based text. And this paper addresses an extension that goes beyond normal checking of isolated single word by taking multiwords as well as a sentence. The candidate words for spelling errors are created by applying function of rules and correction rule table that contains heuristic information of collocation. To prevent excessive creation of candidate words and improve accuracy, we use the high frequency word dictionary that contains 300,000 words derived from corpus. For constituent errors, by applying grammar based partial parsing rules, collocation words errors between the words can be found. We make an experiment with correction techniques on corpora that are the final result of SERI's research, texts, newspaper materials, and public materials. The system has 98% accuracy rate when the 8.5% errors caused by unregistered words were excluded. The average number of prospective candidates suggested by the system is 1.12.

Young-Soog Chae

[1] T. N. Turba. Checking for spelling and typographical errors in computer-based text , 1981, SIGPLAN SIGOA Symposium on Text Manipulation.

[2] P. S. Gingrich,et al. The writer's workbench: Computer aids for text analysis , 1982 .

[3] Tetsuro Nishino,et al. CRITAC - An Experimental System for Japanese Text Proofreading , 1988, IBM J. Res. Dev..

[4] Martin Chodorow,et al. The EPISTLE Text-Critiquing System , 1982, IBM Syst. J..

[5] Julian R. Ullmann,et al. A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[6] James L. Peterson,et al. Computer programs for detecting and correcting spelling errors , 1980, CACM.

[7] Thomas N. Turba. Checking for spelling and typographical errors in computer-based text , 1981 .