The State of the Art of Document Image Degradation Modelling

The literature on models of document image degradation is reviewed, and open problems are listed. In response to the unpleasant fact that the accuracy of document recognition algorithms falls drastically when image quality degrades even slightly, researchers in the last decade have intensiied their study of explicit, quantitative, parameter-ized models of image defects that occur during printing and scanning. Several models have been proposed, some motivated by the physics of image formation and others by the surface statistics of image distributions. A wide range of techniques for estimating parameters of these models has been explored. These models, in the form of pseudo-random generators of synthetic images, permit, for the rst time, investigations into fundamental properties of concrete image recognition problems including the Bayes error of problems and the asymptotic accuracy and domain of competency of classiier technologies. The use of massive sets of synthetic images, in the construction and testing of high-performance classiiers, has accelerated in the last few years. Open problems include the search for methods for comparing competing models and sound methodologies for the use of synthetic data in engineering.

[1]  Daniel P. Lopresti,et al.  Validation of Image Defect Models for Optical Character Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Seybold,et al.  The world of digital typesetting , 1984 .

[3]  Henry S. Baird,et al.  Feature identification for hybrid structural/statistical pattern classification , 1988, Comput. Vis. Graph. Image Process..

[4]  Henry S. Baird,et al.  Document image defect models , 1995 .

[5]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.

[6]  Henry S. Baird,et al.  Document image defect models and their uses , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[7]  George Nagy,et al.  Decision tree design using a probabilistic model , 1984, IEEE Trans. Inf. Theory.

[8]  Colin L. Mallows,et al.  THE EVOLUTION OF A PROBLEM , 1997 .

[9]  J. R. Edinger The image analyzer ― A tool for the evaluation of electrophotographic text quality , 1987 .

[10]  Daniel P. Lopresti,et al.  Spatial sampling effects in optical character recognition , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[11]  Neal Zierler,et al.  Primitive Trinomials Whose Degree is a Mersenne Exponent , 1969, Inf. Control..

[12]  Donald Ervin Knuth,et al.  Computer modern typefaces , 1986 .

[13]  Henry S. Baird,et al.  Decoder banks: versatility, automation, and high accuracy without supervised training , 2004, ICPR 2004.

[14]  R Megargle,et al.  ASTM (American Society for Testing and Materials) standards for medical computing. , 1990, Computers in healthcare.

[15]  Tapas Kanungo,et al.  Document degradation models and a methodology for degradation model validation , 1996 .

[16]  Patrick J. Grother,et al.  The First Census Optical Character Recognition Systems Conference | NIST , 1992 .

[17]  Frank Robert Jenkins The use of synthesized images to evaluate the performance of Ocr devices and algorithms , 1993 .

[18]  Xiaohu Zhang,et al.  Training on severely degraded text-line images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[19]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[20]  Ching Y. Suen,et al.  Large Tree Classifier with Heuristic Search and Global Training , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Henry S. Baird,et al.  The skew angle of printed documents , 1995 .

[22]  Tin Kam Ho,et al.  Large-Scale Simulation Studies in Image Pattern Recognition , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Kristen Maria Summers Document image improvement for OCR as a classification problem , 2003, IS&T/SPIE Electronic Imaging.

[24]  Robert M. Haralick,et al.  Document Degradation Models: Parameter Estimation and Model Validation , 1994, MVA.

[25]  S. Sitharama Iyengar,et al.  Automated system for numerically rating document image quality , 1997, Electronic Imaging.

[26]  M. Maltz,et al.  MTF analysis of xerographic development and transfer , 1988 .

[27]  Henry S. Baird,et al.  Document image quality: making fine discriminations , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[28]  Elisa H. Barney Smith Scanner parameter estimation using bilevel scans of star charts , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[29]  S. V. Rice A report on the accuracy of OCR devices , 1992 .

[30]  M. Maltz Light scattering in xerographic images , 1983 .

[31]  Elisa H. Barney Smith Estimating scanning characteristics from corners in bilevel images , 2001, Document Recognition and Retrieval.

[32]  William F. Schreiber,et al.  Fundamentals of Electronic Imaging Systems , 1986 .

[33]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[34]  Philip A. Chou,et al.  Document Image Decoding Using Markov Source Models , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Robert M. Haralick,et al.  Power functions and their use in selecting distance functions for document degradation model validation , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[36]  Thomas A. Nartker,et al.  Prediction of OCR accuracy using simple image features , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[37]  T. Pavlidis Algorithms for Graphics and Image Processing , 1981, Springer Berlin Heidelberg.

[38]  R. Hard Healthcare industry embraces imaging technologies at AIIM (Association for Information and Image Management). , 1993, Computers in healthcare.

[39]  Elisa H. Barney Smith,et al.  Estimating degradation model parameters from character images , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[40]  Tin Kam Ho,et al.  Perfect metrics , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[41]  Ruari McLean,et al.  The Thames and Hudson Manual of Typography , 1980 .