Generating Synthetic Data for Text Analysis Systems

In this paper we describe work on a system for modeling errors in the output of OCR systems. The project is motivated by the desire to evaluate the performance of various text analysis systems under varying, yet controlled conditions. We describe a set of symbol and page models which are used to degrade an ideal text by introducing errors which typically occur during scanning, decomposition and recognition of document images. A rst generation of the software is described which implements the page models and allows the use of transition probabilities , either extracted from real data or generated synthetically, to corrupt text.

[1]  Henry S. Baird,et al.  Document image defect models , 1995 .

[2]  Suresh Subramaniam,et al.  Performance evaluation of two OCR systems , 1994 .

[3]  Henry S. Baird,et al.  Document image defect models and their uses , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[4]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[5]  Stephen V. Rice,et al.  The Third Annual Test of OCR Accuracy , 1994 .

[6]  Kazem Taghva,et al.  The Effects of Noisy Data on Text Retrieval , 1994, J. Am. Soc. Inf. Sci..

[7]  Hsi-Jian Lee,et al.  Performance analysis of an OCR system via an artificial handwritten Chinese character generator , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[8]  S. M. Hardingy,et al.  An Evaluation of Information Retrieval Accuracy with Simulated Ocr Output , 1992 .