How to extend and bootstrap an existing data set with real-life degraded images

This paper introduces a methodology for bootstrapping and creating large number of groundtruthed "real-life" degraded images from an existing data set with a fraction of the original cost and time. The real-life degradations include geometric distortions, coffee stains, water or ink marks, and folds and creases. The methodology includes an automatic procedure to generate unlimited "real-life" degraded images (with coffee and ink marks and soil spots) without any cost. A small experiment was conducted to illustrate the effectiveness of our methodology. In the experiment, 22 real-life degraded images and the two original images were tested on a commercial OCR system. The accuracy rates of the OCR for the two original pages are 98.46% and 99.34% while the accuracy rates for the degraded pages are ranging from 57.17% to 98.45%, depending on the severity and the type of degradation applied to the pages.

[1]  Henry S. Baird,et al.  Document image defect models , 1995 .

[2]  I.T. Phillips,et al.  The implementation methodology for a CD-ROM English document database , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).