Automatic analysis of images of forms is a problem of both practical and theoretical interest; due to its importance in office automation, and due to the conceptual challenges posed for document image analysis, respectively. We describe an approach to the extraction of text, both typed and handwritten, from scanned and digitized images of filled-out forms. In decomposing a filled-out form into three basic components of boxes, line segments and the remainder (handwritten and typed characters, words, and logos), the method does not use a priori knowledge of form structure. The input binary image is first segmented into small and large connected components. Complex boxes are decomposed into elementary regions using an approach based on key-point analysis. Handwritten and machine-printed text that touches or overlaps guide lines and boxes are separated by removing lines. Characters broken by line removal are rejoined using a character patching method. Experimental results with filled-out forms, from several different domains (insurance, banking, tax, retail and postal) are given.
[1]
Donato Malerba,et al.
An experimental page layout recognition system for office document automatic classification: an integrated approach for inductive generalization
,
1990,
[1990] Proceedings. 10th International Conference on Pattern Recognition.
[2]
Friedrich M. Wahl,et al.
Document Analysis System
,
1982,
IBM J. Res. Dev..
[3]
S.C. Hinds,et al.
A rule-based system for document image segmentation
,
1990,
[1990] Proceedings. 10th International Conference on Pattern Recognition.
[4]
Lawrence O'Gorman,et al.
Document Image Analysis
,
1996
.
[5]
Rangachar Kasturi,et al.
A Robust Algorithm for Text String Separation from Mixed Text/Graphics Images
,
1988,
IEEE Trans. Pattern Anal. Mach. Intell..
[6]
George Nagy,et al.
DOCUMENT ANALYSIS WITH AN EXPERT SYSTEM
,
1986
.
[7]
Yoshitake Tsuji.
Document image analysis for generating syntactic structure description
,
1988,
[1988 Proceedings] 9th International Conference on Pattern Recognition.