Logical Labeling of Fixed Layout PDF Documents Using Multiple Contexts

The task of logical structure recovery is known to be of crucial importance, yet remains unsolved not only for image based document but also for born-digital document system. In this work, the modeling of contextual information based on 2D Conditional Random Fields is proposed to learn page structure for born-digital fixed-layout documents. Heuristic prior knowledge of Portable Document Format (PDF) content and layout are interpreted to construct neighborhood graphs and various pair wise clique templates for the modeling of multiple contexts. By integrating local and contextual observations obtained from PDF attributes, the ambiguities of semantic labels are better resolved. Experimental comparisons for six types of clique templates has demonstrated the benefits of contextual information in logical labeling of 16 finely defined categories.

[1]  Abdel Belaïd,et al.  Document Logical Structure Analysis Based on Perceptive Cycles , 2006, Document Analysis Systems.

[2]  Giovanni Soda,et al.  Table of contents recognition for converting PDF documents in e-book formats , 2010, DocEng '10.

[3]  Gerhard Paass,et al.  Machine Learning for Document Structure Recognition , 2012, Modeling, Learning, and Processing of Text Technological Data Structures.

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[6]  Zhi Tang,et al.  Graphic composite segmentation for PDF documents with complex layouts , 2013, Electronic Imaging.

[7]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Thierry Paquet,et al.  Document Image Segmentation Using a 2D Conditional Random Field Model , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Liangcai Gao,et al.  Mathematical Formula Identification in PDF Documents , 2011, 2011 International Conference on Document Analysis and Recognition.

[10]  Sargur N. Srihari,et al.  Segmentation and labeling of documents using conditional random fields , 2007, Electronic Imaging.

[11]  Zhi Tang,et al.  Graph-based layout analysis for PDF documents , 2013, Electronic Imaging.

[12]  Maurizio Rigamonti,et al.  Xed: a new tool for extracting hidden structures from electronic documents , 2004, First International Workshop on Document Image Analysis for Libraries, 2004. Proceedings..

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Zhi Tang,et al.  Reflowing-driven paragraph recognition for electronic books in PDF , 2011, Electronic Imaging.

[15]  Jean-Luc Bloechle,et al.  XCDF: A Canonical and Structured Document Format , 2006, Document Analysis Systems.

[16]  Giovanni Soda,et al.  Conversion of PDF Books in ePub Format , 2011, 2011 International Conference on Document Analysis and Recognition.

[17]  Jean-Luc Bloechle,et al.  OCD Dolores - Recovering Logical Structures for Dummies , 2012, 2012 10th IAPR International Workshop on Document Analysis Systems.