A character degradation model for grayscale ancient document images

Kanungo noise model is widely used to test the robustness of different binary document image analysis methods towards noise. This model only works with binary images while most document images are in grayscale. Because binarizing a document image might degrade its contents and lead to a loss of information, more and more researchers are currently focusing on segmentation-free methods (Angelika et al [2]). Thus, we propose a local noise model for grayscale images. Its main principle is to locally degrade the image in the neighbourhoods of “seed-points” selected close to the character boundary. These points define the center of “noise regions”. The pixel values inside the noise region are modified by a Gaussian random distribution to make the final result more realistic. While Kanungo noise models scanning artifacts, our model simulates degradations due to the age of the document itself and printing/writing process such as ink splotches, white specks or streaks. It is very easy for users to parameterize and create a set of benchmark databases with an increasing level of noise. These databases will further be used to test the robustness of different grayscale document image analysis methods (i.e. text line segmentation, OCR, handwriting recognition).

[1]  Robert M. Haralick,et al.  A Statistical, Nonparametric Methodology for Document Degradation Model Validation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Nikos Papamarkos,et al.  An Evaluation Technique for Binarization Algorithms , 2008, J. Univers. Comput. Sci..

[3]  Apostolos Antonacopoulos,et al.  Ground Truth for Layout Analysis Performance Evaluation , 2006, Document Analysis Systems.

[4]  Muriel Visani,et al.  A protocol to characterize the descriptive power and the complementarity of shape descriptors , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[5]  R. Loce,et al.  Halftone banding due to vibrations in a xerographic image bar printer , 1990 .

[6]  Angelika Garz,et al.  Binarization-Free Text Line Segmentation for Historical Documents Based on Interest Point Clustering , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.

[7]  Tony P. Pridmore,et al.  Building Synthetic Graphical Documents for Performance Evaluation , 2007, GREC.

[8]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  Sébastien Adam,et al.  Automatic Ground-truth Generation for Document Image Analysis and Understanding , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[10]  Dov Dori,et al.  A line drawings degradation model for performance characterization , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Bénédicte Allier Contribution à la numérisation des collections : apports des contours actifs , 2003 .

[12]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.

[13]  R. F. Moghaddam,et al.  Low quality document image modeling and enhancement , 2009, International Journal of Document Analysis and Recognition (IJDAR).