Combining Trigram and Winnow in Thai OCR Error Correction

From specific characteristics of Thai, Thai OCR errors frequently depend on nearby characters. To capture this characteristic of Thai OCR errors more appropriately, we propose the idea of using the varied n-gram of the character confusion probability for scoring approximately matched words. The value of n depends on characteristics of each character. For languages which have no explicit word boundary, word boundary ambiguity has to be resolved before correcting errors. In this paper, a maximal matching algorithm is used instead of a more complicated word segmentation algorithm to reduce a time complexity problem. Finally, a hybrid method which combines a part-of-speech trigram model with Winnow algorithm is used to selected the most probable correction.