The Bible, truth, and multilingual OCR evaluation

In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.

[1]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  Luc M. Vincent,et al.  Benchmarking page segmentation algorithms , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Tapas Kanungo,et al.  Document degradation models and a methodology for degradation model validation , 1996 .

[4]  Henry S. Baird,et al.  Document image defect models , 1995 .

[5]  Donald E. Knuth,et al.  TeX: The Program , 1986 .

[6]  Luc Vincent,et al.  Pink Panther: A Complete Environment For Ground-Truthing And Benchmarking Document Page Segmentation , 1998, Pattern Recognit..

[7]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[8]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[9]  Kenneth Ward Church,et al.  Introduction to the Special Issue on Computational Linguistics Using Large Corpora , 1993, Comput. Linguistics.

[10]  Xinhua Zhuang,et al.  Image Analysis Using Mathematical Morphology , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Tapas Kanungo,et al.  OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products , 1999, Electronic Imaging.

[12]  Luc Vincent,et al.  Ground-truthing and benchmarking document page segmentation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[13]  Leslie Lamport,et al.  Latex : A Document Preparation System , 1985 .

[14]  Robert M. Haralick,et al.  Document Degradation Models: Parameter Estimation and Model Validation , 1994, MVA.

[15]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[16]  Robert M. Haralick,et al.  An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  I. Dan Melamed Automatic Construction of Clean Broad-Coverage Translation Lexicons , 1996, AMTA.

[18]  Suresh Subramaniam,et al.  Performance evaluation of two OCR systems , 1994 .