Business form classification using strings

Business forms are "linear" documents which can be accurately described by a one-dimensional data structure. This paper proposes a novel approach for form identification using strings. This application can be used as a basis for extension to other "linear" documents such as logos or line drawings. A set of known blank forms is stored in a database and incoming forms are automatically matched to one of these. In addition, forms which are not in the database can also be detected. A novel and simple method is used for matching by considering a distinctive "signature" for each document. This takes the shape of a string which describes the elements present on the form. Included are the location and size of lines, corners and blocks of text, quantised as discrete symbols. A specially adapted and efficient string edit distance calculation is then applied for matching. Unregistered forms can be detected by examining the unmatched elements between two strings. This novel string format makes it possible to extend the conventional one-dimensional representation possibilities of strings to a richer "one-and-a-half dimensional" structure and requires no training.

[1]  Suzanne Liebowitz Taylor,et al.  Extraction of data from preprinted forms , 2007, Machine Vision and Applications.

[2]  Toyohide Watanabe,et al.  Layout Recognition of Multi-Kinds of Table-Form Documents , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Horst Bunke,et al.  Syntactic and Structural Pattern Recognition , 1988, NATO ASI Series.

[4]  Ching Y. Suen,et al.  Document structures: A survey , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[5]  Anil K. Jain,et al.  Goal-Directed Evaluation of Binarization Methods , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Arturo Pizano,et al.  A business form recognition system , 1991, [1991] Proceedings The Fifteenth Annual International Computer Software & Applications Conference.

[7]  Dave Elliman,et al.  A review of segmentation and contextual analysis techniques for text recognition , 1990, Pattern Recognit..

[8]  Horst Bunke STRING MATCHING FOR STRUCTURAL PATTERN RECOGNITION , 1990 .

[9]  Siu Cheung Hui,et al.  A syntactic business form classifier , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[10]  S.W. Lam,et al.  Anatomy of a form reader , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[11]  Gian Antonio Mian,et al.  Trademark shapes description by string-matching techniques , 1994, Pattern Recognit..

[12]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[13]  George R. Cross,et al.  A two-step string-matching procedure , 1991, Pattern Recognit..

[14]  Maurice Maes,et al.  Polygonal shape recognition using string-matching techniques , 1991, Pattern Recognit..