The Bible, Truth, and Multilingual Optical Character Recognition

As global, on-line access to information becomes more common, the technology of multilingual optical character recognition (OCR) increases in importance as a way to convert on-paper documents into electronic, searchable, text. In OCR, as in any evolving technology, careful evaluation is an integral part of research and development. OCR evaluation is done by comparing a system’s output for a dataset of document test images with the corresponding correct symbolic text, known as ground truth. Unfortunately, the usual way of obtaining ground truth is by manual data-entry by humans, which is labor-intensive, time-consuming, expensive, and prone to errors. Worse, because no single set of “ground truth” evaluation data can be used in more than one language, there has until now been no way to conduct carefully controlled OCR experiments in a multilingual setting. To address this problem, we introduce the Bible as a dataset for evaluating multilingual OCR accuracy. Bible translations are closely parallel in structure, careful to preserve meaning, surprisingly relevant

[1]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[2]  S. Griffis EDITOR , 1997, Journal of Navigation.

[3]  Henry S. Baird,et al.  Document image defect models , 1995 .

[4]  Philip Resnik,et al.  The Bible as a Parallel Corpus: Annotating the ‘Book of 2000 Tongues’ , 1999, Comput. Humanit..

[5]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[6]  ResnikPhilip,et al.  Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation , 1999 .

[7]  Donald E. Knuth,et al.  TeX: The Program , 1986 .

[8]  Tapas Kanungo,et al.  Estimation of morphological degradation model parameters , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[9]  Douglas W. Oard,et al.  Improved Cross-Language Retrieval using Backoff Translation , 2001, HLT.