Document representation refinement for precise region description

Precise description of layout entities (content regions on a page) is crucial for all but the most trivial document analysis and recognition applications. The output of layout analysis methods and state-of-the-art OCR systems varies significantly, from bounding boxes (e.g. Tesseract) to stacks of text line rectangles (e.g. ABBYY FineReader). There is a clear need for a consistent and accurate representation of regions (e.g. text paragraphs, graphics entities etc.) for further processing, correction and performance evaluation (comparison of segmentation results with ground truth regions). This paper describes a method for refinement of document representations by fitting polygons around lower-level layout objects (such as text lines, words and glyphs) in a systematic way that reconstructs region outlines and preserves the fine details of complex layouts. Experimental results on a standard dataset demonstrate the validity and usefulness of the proposed approach.

[1]  Apostolos Antonacopoulos,et al.  The PAGE (Page Analysis and Ground-Truth Elements) Format Framework , 2010, 2010 20th International Conference on Pattern Recognition.

[2]  Apostolos Antonacopoulos,et al.  ICDAR 2009 Page Segmentation Competition , 2003, 2009 10th International Conference on Document Analysis and Recognition.

[3]  Tim Ritchings,et al.  Representation and classification of complex-shaped printed regions using white tiles , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[4]  Apostolos Antonacopoulos,et al.  Aletheia - An Advanced Document Layout and Text Ground-Truthing System for Production Environments , 2011, 2011 International Conference on Document Analysis and Recognition.

[5]  Thomas M. Breuel The hOCR Microformat for OCR Workflow and Results , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[6]  Apostolos Antonacopoulos,et al.  A Realistic Dataset for Performance Evaluation of Document Layout Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[7]  Apostolos Antonacopoulos,et al.  ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[8]  Basilios Gatos,et al.  ICDAR2005 page segmentation competition , 2007, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  Friedrich M. Wahl,et al.  Block segmentation and text extraction in mixed text/image documents , 1982, Comput. Graph. Image Process..

[10]  Azriel Rosenfeld,et al.  Computer Vision , 1988, Adv. Comput..

[11]  Apostolos Antonacopoulos,et al.  Scenario Driven In-depth Performance Evaluation of Document Layout Analysis Methods , 2011, 2011 International Conference on Document Analysis and Recognition.

[12]  Linda G. Shapiro,et al.  Computer Vision , 2001 .

[13]  Apostolos Antonacopoulos,et al.  The IMPACT dataset of historical document images , 2013, HIP '13.

[14]  David Vernon,et al.  Machine vision - automated visual inspection and robot vision , 1991 .