Document image ground truth generation from electronic text

The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed an approach, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with Windows enhanced metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded, and used for training and evaluating OCR systems. We briefly survey related work and describe our system.

[1]  Daniel P. Lopresti,et al.  Validation of Image Defect Models for Optical Character Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Henry S. Baird,et al.  Document image quality: making fine discriminations , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[3]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[4]  Philip Resnik,et al.  The Bible, truth, and multilingual OCR evaluation , 1999, Electronic Imaging.

[5]  Robert M. Haralick,et al.  An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[7]  Robert M. Haralick,et al.  A Statistical, Nonparametric Methodology for Document Degradation Model Validation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Henry S. Baird,et al.  Document image defect models , 1995 .

[9]  Volker Märgner,et al.  Synthetic data for Arabic OCR system development , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[10]  Robert M. Haralick,et al.  Nonlinear global and local document degradation models , 1994, Int. J. Imaging Syst. Technol..