Document degradation models and a methodology for degradation model validation

Printing, photocopying and scanning processes degrade the image quality of a document. Although research in document understanding started in the sixties, only two document degradation models have been proposed thus far. Furthermore, no attempts have been made to rigorously validate them. In document understanding research, models for image degradations are crucial in many ways. Models allow us to (i) conduct controlled experiments to study the break-down points of the systems, (ii) create large data sets with groundtruth for training classifiers, (iii) design optimal noise removal algorithms, (iv) choose values for the free parameters of the algorithms, etc. In this thesis two document degradation models are described. The first model accounts for local pixel-level degradations that occur while printing, photocopying and scanning a document. The second model accounts for the perspective and illumination distortions that occur while photocopying or scanning a thick, bound document. The local distortion model allows us the create large data sets of synthetically generated documents, in any language, along with the associated groundtruth information quite easily. Unlike isolated character databases, our data sets are a much better representation of the real world since they account for the real-world character and word occurrence probabilities, and character and word bi-gram probabilities naturally. Moreover, since our methodology puts the text, layout, formatting, resolution, and font details of the document image under the experimenter's control, a large variety of controlled experiments that were not possible earlier are now possible. Next, an automatic document registration and character groundtruthing procedure is described. This procedure produces very accurate character groundtruth for scanned documents in any language, which had not been possible until now. The method essentially registers the ideal image to a scanned version and then transforms the groundtruth associated with the ideal image through the registration transformation. This method can be used to generate groundtruth for documents in any language, and even FAXed documents. A data set having 33 English scanned document images with character groundtruth for 62000 symbols was created using this procedure. A non-parametric statistical procedure for estimating the parameters of the local degradation model from a sample of real degraded documents is then discussed. The estimation procedure allows researchers to generate large data sets from small samples of real data. Such procedures for estimating parameters do not exist for other document degradation models. In fact, our approach can be easily adapted to estimate the parameters of other models as well. Finally, a statistical methodology that can be used to validate the local degradation models is described. This method is based on a non-parametric, two-sample permutation test. A variant of the method allows approximate validation tests instead. Another standard statistical device--the power function--is then used to choose between algorithm variables such as distance functions. Since the validation and power function procedures are independent of the model, they can be used to validate any other degradation model. A method for comparing any two models is also described. It uses p-values associated with the estimated models to select the model that is closer to the real world.

[1]  Robert M. Haralick,et al.  A methodology for quantitative performance evaluation of detection algorithms , 1995, IEEE Trans. Image Process..

[2]  Henry S. Baird,et al.  Document image defect models , 1995 .

[3]  Y. J. Tejwani,et al.  Robot vision , 1989, IEEE International Symposium on Circuits and Systems,.

[4]  John E. Howland,et al.  Computer graphics , 1990, IEEE Potentials.

[5]  Robert M. Haralick,et al.  A quantitative methodology for analyzing the performance of detection algorithms , 1993, 1993 (4th) International Conference on Computer Vision.

[6]  Donald E. Knuth,et al.  TeX: The Program , 1986 .

[7]  Juris Hartmanis,et al.  Turing Award lecture on computational complexity and the nature of computer science , 1994, CACM.

[8]  Robert M. Haralick,et al.  Global and local document degradation models , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[9]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[10]  K. Young,et al.  AMERICAN SOCIETY OF MECHANICAL ENGINEERS. , 1880, Science.

[11]  Robert M. Haralick,et al.  Document Degradation Models: Parameter Estimation and Model Validation , 1994, MVA.

[12]  Muralidhara Subbarao,et al.  Depth recovery from blurred edges , 1988, Proceedings CVPR '88: The Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Hsieh S. Hou,et al.  Digital document processing , 1983 .

[14]  Leslie Lamport,et al.  Latex : A Document Preparation System , 1985 .

[15]  Karl-Rudolf Koch,et al.  Parameter estimation and hypothesis testing in linear models , 1988 .

[16]  George Wolberg,et al.  Digital image warping , 1990 .

[17]  Azriel Rosenfeld,et al.  The processing of form documents , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[18]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[19]  David R. Ferguson,et al.  Intelligent Forms Processing , 1990, IBM Syst. J..

[20]  Robert M. Haralick,et al.  Propagating covariance in computer vision , 1994, Proceedings of 12th International Conference on Pattern Recognition.