Improvement of Korean Proofreading System Using Corpus and Collocation Rules

This paper presents the 'techniques of correcting for spelling errors, orthographical errors, and grammatical errors in computer-based text. And this paper addresses an extension that goes beyond normal checking of isolated single word by taking multiwords as well as a sentence. The candidate words for spelling errors are created by applying function of rules and correction rule table that contains heuristic information of collocation. To prevent excessive creation of candidate words and improve accuracy, we use the high frequency word dictionary that contains 300,000 words derived from corpus. For constituent errors, by applying grammar based partial parsing rules, collocation words errors between the words can be found. We make an experiment with correction techniques on corpora that are the final result of SERI's research, texts, newspaper materials, and public materials. The system has 98% accuracy rate when the 8.5% errors caused by unregistered words were excluded. The average number of prospective candidates suggested by the system is 1.12.