Document-Form Identification Using Constellation Matching of Keywords Abstracted by Character Recognition

A document-form identification method based on constellation matching of targets is proposed. Mathematical analysis shows that the method achieves a high identification rate by preparing plural targets. The method consists of two parts: (i) extraction of targets such as important keywords in a document by template matching between recogised characters and word strings in a keyword dictionary, and (ii) analysis of the positional or semantic relationship between the targets by point-pattern matching between these targets and word location information in the keyword dictionary. All characters in the document are recognised by means of a conventional character-recognition method. An automatic keyword-determination method, which is necessary for making a keyword dictionary beforehand, is also proposed. This method selects the most suitable keywords from a general word dictionary by measuring the uniqueness of keywords and the stability of their recognition. Experiments using 671 sample documents with 107 different forms in total confirmed that (i) the keyword-determination method can determine sets of keywords automatically in 92.5% of 107 different forms and (ii) that the form-identification method can correctly identify 97.1% of 671 document samples at a rejection rate 2.9%.

[1]  Horst Bunke,et al.  A fast algorithm for finding the nearest neighbor of a word in a dictionary , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[2]  T. Watanabe,et al.  A framework for validating recognized results in understanding table-form document images , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[3]  Hiroshi Sako,et al.  A recursive analysis for form cell recognition , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[4]  Fumitaka Kimura,et al.  Improvements of a lexicon directed algorithm for recognition of unconstrained handwritten words , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[5]  Yukio Ogawa,et al.  A recognition method for touching Japanese handwritten characters , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[6]  Toyohide Watanabe,et al.  Structure recognition of various kinds of table-form documents , 1994, Systems and Computers in Japan.

[7]  Masashi Koga,et al.  A method for connecting disappeared junction patterns on frame lines in form documents , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[8]  Yasuaki Nakano,et al.  Segmentation methods for character recognition: from segmentation to document structure analysis , 1992, Proc. IEEE.

[9]  Naohiro Furukawa,et al.  The constellation matching and its application , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[10]  Malayappan Shridhar,et al.  A Lexicon Directed Algorithm for Recognition of Unconstrained Handwritten Words (Special Issue on Document Analysis and Recognition) , 1994 .

[11]  Toyohide Watanabe,et al.  An Approach to Recover Recognition Failure in Understanding Table- Form Documents , 1994 .

[12]  Takafumi Miyatake,et al.  A position recognition algorithm for semiconductor alignment based on structural pattern matching , 1989, IEEE Trans. Acoust. Speech Signal Process..

[13]  Francesca Cesarini,et al.  INFORMys: A Flexible Invoice-Like Form-Reader System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..