Automatic document classification and indexing in high-volume applications

Abstract. In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume application. Integration into production lines is under execution.

[1]  Francesca Cesarini,et al.  Structured document segmentation and representation by the modified X-Y tree , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Heiko Maus Towards a Functional Integration of Document Analysis and Understanding in Workflow Management Systems , 1999 .

[3]  Horst Bunke,et al.  Model-Based Analysis and Understanding of Check Forms , 1994, Int. J. Pattern Recognit. Artif. Intell..

[4]  Anna Maria Colla,et al.  Automatic Analysis and indexing of variable-layout documents , 2000, RIAO.

[5]  Andreas Dengel,et al.  Initial learning of document structure , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[6]  King-Sun Fu,et al.  Syntactic Pattern Recognition And Applications , 1968 .

[7]  Donato Malerba,et al.  A syntactic distance for partially matching learned concepts against noisy structural object descriptions , 1991 .

[8]  Alan Pope,et al.  The CORBA reference guide , 1997 .

[9]  King-Sun Fu,et al.  A graph distance measure for image analysis , 1984, IEEE Transactions on Systems, Man, and Cybernetics.

[10]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[11]  Anna Maria Colla,et al.  "STRETCH": a system for document storage and retrieval by content , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[12]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[13]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[14]  Arun K. Majumdar,et al.  SYNTACTIC PATTERN RECOGNITION , 2001 .

[15]  King-Sun Fu,et al.  An Image Understanding System Using Attributed Symbolic Representation and Inexact Graph-Matching , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Andreas Dengel,et al.  Message extraction from printed documents-a complete solution , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[17]  Andreas Dengel,et al.  Clustering and classification of document structure-a machine learning approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[18]  Francesca Cesarini,et al.  INFORMys: A Flexible Invoice-Like Form-Reader System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..