论文信息 - Performance evaluation of two Arabic OCR products

Performance evaluation of two Arabic OCR products

Numerous Optical Character Recognition (OCR) companies claim that their products have near-perfect recognition accuracy (close to 99.9%). In practice, however, these accuracy rates are rarely achieved. Most systems break down when the input document images are highly degraded, such as scanned images of carbon-copy documents, documents printed on low-quality paper, and documents that are n-th generation photocopies. Besides, the end user cannot compare the relative performances of the products because the various accuracy results are not reported on the same dataset.. In this article we report our evaluation results for two popular Arabic OCR products: (1) Sakhr OCR and (2) OmniPage for Arabic. In our evaluation we establish that the Sakhr OCR product has 15.47% lower page error rate relative to the OmniPage page error rate. The absolute page accuracy rates for Sakhr and Omnipage are 90.33% and 86.89% respectively. Our evaluation was performed using the SAIC Arabic image dataset, and we used only those pages for which both OCR systems produced output. A scatter-plot of the page accuracy-rate pairs reveals that Sakhr in general performs better on low-accuracy (degraded) pages. The scatter-plot visualization technique allows an algorithm developer to easily detect and analyze outliers in the results.

Tapas Kanungo | Gregory Marton | Osama Bulbul

[1] Suresh Subramaniam,et al. Performance evaluation of two OCR systems , 1994 .

[2] Henry S. Baird,et al. Document image defect models , 1995 .

[3] Luc Vincent,et al. Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[4] Stephen V. Rice,et al. The Fourth Annual Test of OCR Accuracy , 1995 .

[5] Luc M. Vincent,et al. Benchmarking page segmentation algorithms , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[6] Tapas Kanungo,et al. Document degradation models and a methodology for degradation model validation , 1996 .

[7] Philip Resnik,et al. The Bible, truth, and multilingual OCR evaluation , 1999, Electronic Imaging.

[8] Tapas Kanungo,et al. OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products , 1999, Electronic Imaging.

[9] Robert M. Haralick,et al. An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[10] John B. Shoven,et al. I , Edinburgh Medical and Surgical Journal.