Semi-synthetic Document Image Generation Using Texture Mapping on Scanned 3D Document Shapes

This article presents a method for generating semi-synthetic images of old documents where the pages might be torn (not flat). By using only 2D deformation models, most existing methods give non-realistic synthetic document images. Thus, we propose to use 3D approach for reproducing geometric distortions in real documents. First, a new proposed texture coordinate generation technique extracts texture coordinates of each vertex in the document shape (mesh) resulting from 3D scanning of a real degraded document. Then, any 2D document image can be overlayed on the mesh by using an existing texture image mapping method. As a result, many complex real geometric distortions can be integrated in generated synthetic images. These images then can be used for enriching training sets or for performance evaluation. The degradation method here is jointly used with the character degradation model we proposed in [1] to generate the 6000 semi-synthetic degraded images of the music score removal staff line competition of ICDAR 2013.

[1]  Daniel P. Lopresti,et al.  Validation of Image Defect Models for Optical Character Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Tony P. Pridmore,et al.  Generation of synthetic documents for performance evaluation of symbol recognition & spotting systems , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[3]  Elisa H. Barney Smith Modeling image degradations for improving OCR , 2008, 2008 16th European Signal Processing Conference.

[4]  Christoph H. Lampert,et al.  Document capture using stereo vision , 2004, DocEng '04.

[5]  Tamás VARGA,et al.  Effects of Training Set Expansion in Handwriting Recognition Using Synthetic Data , 2003 .

[6]  Muriel Visani,et al.  A character degradation model for grayscale ancient document images , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[7]  Horst Bunke,et al.  Generation of synthetic training data for an HMM-based handwriting recognition system , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Bui Tuong Phong Illumination for computer generated pictures , 1975, Commun. ACM.

[9]  David S. Doermann,et al.  Geometric Rectification of Camera-Captured Document Images , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[11]  Dov Dori,et al.  A line drawings degradation model for performance characterization , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[12]  Robert M. Haralick,et al.  A Statistical, Nonparametric Methodology for Document Degradation Model Validation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Alicia Fornés,et al.  The ICDAR 2011 Music Scores Competition: Staff Removal and Writer Identification , 2011, 2011 International Conference on Document Analysis and Recognition.

[14]  R. Loce,et al.  Halftone banding due to vibrations in a xerographic image bar printer , 1990 .

[15]  Henry S. Baird,et al.  Document image defect models , 1995 .

[16]  Henry S. Baird,et al.  The State of the Art of Document Image Degradation Modelling , 2007 .

[17]  Minoru Mori,et al.  GENERATING NEW SAMPLES FROM HANDWRITTEN NUMERALS BASED ON POINT CORRESPONDENCE , 2004 .

[18]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.