Morphological Tagging Approach in Document Analysis of Invoices

In this paper a morphological tagging approach for document image invoice analysis is described. Tokens close by their morphology and confirmed in their location within different similar contexts make apparent some parts of speech representative of the structure elements. This bottom up approach avoids the use of an priori knowledge provided that there are redundant and frequent contexts in the text. The approach is applied on the invoice body text roughly recognized by OCR and automatically segmented. The method makes possible the detection of the invoice articles and their different fields. The regularity of the article composition and its redundancy in the invoice is a good help for its structure. The recognition rate of 276 invoices and 1704 articles, is over than 91.02% for articles and 92.56% for fields.

[1]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[2]  Yolande Belaïd,et al.  Adaptive technology for mail-order form segmentation , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[3]  Yolande Belaïd,et al.  Form Analysis by Neural Classification of Cells , 1998, Document Analysis Systems.

[4]  William A. Barrett,et al.  Consensus-based table form recognition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[5]  A. Peter Johnson,et al.  A Fast Algorithm for Bottom-Up Document Layout Analysis , 1997, IEEE Trans. Pattern Anal. Mach. Intell..