Post-Processing Approach for Printed Chinese Character Recognition
暂无分享,去创建一个
In Chinese OCR post-processing,the high-order Chinese n-gram language models,such as word based tri-gram and four-gram is still a challenging issue because of the data sparseness issue and large memory cost led by big model size.In this paper,we focus on the post-processing of printed Chinese character recognition and propose a byte-based language model.By choosing byte as the representing unit of language model,we achieve a remarkable reduction of model size which overcomes the sparseness problem to a great extent.The experimental results show that the new language model based on byte works very well with higher performance and lowest time and space costs.For the test set with segmentation errors,the recognition accuracy increases from 88.67% to 98.32%,which means 85.18% error reduction.Compared with the system using traditional word based tri-gram,the new system saves 95% time cost and nearly 98% memory cost at almost no cost in the accuracy performance.