GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION

Abstract : The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness.

[1]  Tapas Kanungo,et al.  OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products , 1999, Electronic Imaging.

[2]  Horst Bunke,et al.  Generation of synthetic training data for an HMM-based handwriting recognition system , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[3]  Tapas Kanungo,et al.  Attributed point matching for automatic groundtruth generation , 2002, International Journal on Document Analysis and Recognition.

[4]  Dov Dori,et al.  A line drawings degradation model for performance characterization , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  Robert M. Haralick,et al.  An Automatic Closed-Loop Methodology for Generating Character Groundtruth for Scanned Documents , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  George Nagy,et al.  Twenty Years of Document Image Analysis in PAMI , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  David Doermann Groundtruth Image Generation from Electronic Text ( Demonstration ) , 2003 .

[8]  Robert M. Haralick,et al.  Nonlinear global and local document degradation models , 1994, Int. J. Imaging Syst. Technol..

[9]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  David Doermann,et al.  Generating Synthetic Data for Text Analysis Systems , 1995 .

[11]  Tapas Kanungo,et al.  TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit for OCR , 2000, IS&T/SPIE Electronic Imaging.

[12]  Tapas Kanungo,et al.  Performance evaluation of two Arabic OCR products , 1999, Other Conferences.

[13]  Henry S. Baird,et al.  Document image defect models , 1995 .

[14]  Xiaohu Zhang,et al.  Training on severely degraded text-line images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[15]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[16]  Daniel P. Lopresti,et al.  Validation of Image Defect Models for Optical Character Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[18]  Ihsin T. Phillips How to extend and bootstrap an existing data set with real-life degraded images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[19]  Suresh Subramaniam,et al.  Performance evaluation of two OCR systems , 1994 .

[20]  Volker Märgner,et al.  Synthetic data for Arabic OCR system development , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[21]  Henry S. Baird,et al.  Document image quality: making fine discriminations , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[22]  Philip Resnik,et al.  The Bible, truth, and multilingual OCR evaluation , 1999, Electronic Imaging.

[23]  Henry S. Baird,et al.  The State of the Art of Document Image Degradation Modelling , 2007 .

[24]  Shamik Sural,et al.  A two-state Markov chain model of degraded document images , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[25]  Tapas Kanungo,et al.  Document degradation models and a methodology for degradation model validation , 1996 .

[26]  Tin Kam Ho,et al.  Evaluation of OCR Accuracy Using Synthetic Data , 1995 .

[27]  Gaurav Sharma,et al.  Show-through cancellation in scans of duplex printed documents , 2001, IEEE Trans. Image Process..

[28]  Anna Tonazzini,et al.  Independent component analysis for document restoration , 2004, Document Analysis and Recognition.

[29]  David S. Doermann,et al.  Document image ground truth generation from electronic text , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[30]  Rolf Ingold,et al.  A study of document image degradation effects on font recognition , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.