Semantic information extraction from images of complex documents

Even though the digital processing of documents is increasingly widespread in industry, printed documents are still largely in use. In order to process electronically the contents of printed documents, information must be extracted from digital images of documents. When dealing with complex documents, in which the contents of different regions and fields can be highly heterogeneous with respect to layout, printing quality and the utilization of fonts and typing standards, the reconstruction of the contents of documents from digital images can be a difficult problem. In the present article we present an efficient solution for this problem, in which the semantic contents of fields in a complex document are extracted from a digital image.

[1]  Fatos Xhafa,et al.  Learning Structure and Schemas from Documents , 2011, Studies in Computational Intelligence.

[2]  Pinar Duygulu Sahin,et al.  A hierarchical representation of form documents for identification and retrieval , 2002, International Journal on Document Analysis and Recognition.

[3]  Francesca Cesarini,et al.  INFORMys: A Flexible Invoice-Like Form-Reader System , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Chew Lim Tan,et al.  Page segmentation and text extraction from gray-scale images in microfilm format , 2000, IS&T/SPIE Electronic Imaging.

[5]  Yasuto Ishitani Model-Based Information Extraction and its Applications for Document Images , 2001 .

[6]  Everton Felix Teixeira Análise de imagens digitais na avaliação de plântulas de milho. , 2005 .

[7]  M. Emre Celebi Real-Time Implementation of Order-Statistics Based Directional Filters , 2009, IET Image Process..

[8]  Bidyut Baran Chaudhuri,et al.  An End-to-End Administrative Document Analysis System , 2008, 2008 The Eighth IAPR International Workshop on Document Analysis Systems.

[9]  Nidhi Chandrakar,et al.  Study and comparison of various image edge detection techniques , 2012 .

[10]  Yolande Belaïd,et al.  A Case-Based Reasoning Approach for Invoice Structure Extraction , 2007 .

[11]  John F. Canny,et al.  A Computational Approach to Edge Detection , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Thomas M. Breuel A practical, globally optimal algorithm for geometric matching under uncertainty , 2001, Electron. Notes Theor. Comput. Sci..

[13]  Dino Isa,et al.  An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization , 2011, Applied Intelligence.

[14]  Thomas Breuel,et al.  Recent progress on the OCRopus OCR system , 2009, MOCR '09.

[15]  Andreas Dengel,et al.  Seizing the Treasure: Transferring Knowledge in Invoice Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[16]  Francesca Cesarini,et al.  Analysis and understanding of multi-class invoices , 2003, Document Analysis and Recognition.

[17]  Yolande Belaïd,et al.  A Case-Based Reasoning Approach for Invoice Structure Extraction , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[18]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[19]  R. Maini Study and Comparison of Various Image Edge Detection Techniques , 2004 .

[20]  King-Sun Fu,et al.  An Image Understanding System Using Attributed Symbolic Representation and Inexact Graph-Matching , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Rachid Deriche,et al.  A computational approach for corner and vertex detection , 1993, International Journal of Computer Vision.

[22]  Yolande Belaïd,et al.  Administrative Document Analysis and Structure , 2011, Learning Structure and Schemas from Documents.

[23]  Palaiahnakote Shivakumara,et al.  Accurate video text detection through classification of low and high contrast images , 2010, Pattern Recognit..