DocParser: Hierarchical Structure Parsing of Document Renderings

Translating document renderings (e.g. PDFs, scans) into hierarchical structures is extensively demanded in the daily routines of many real-world applications, and is often a prerequisite step of many downstream NLP tasks. Earlier attempts focused on different but simpler tasks such as the detection of table or cell locations within documents; however, a holistic, principled approach to inferring the complete hierarchical structure in documents is missing. As a remedy, we developed "DocParser": an end-to-end system for parsing the complete document structure - including all text elements, figures, tables, and table cell structures. To the best of our knowledge, DocParser is the first system that derives the full hierarchical document compositions. Given the complexity of the task, annotating appropriate datasets is costly. Therefore, our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-specific data is scarce, which we address by a novel approach to weak supervision. Our computational experiments confirm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 37.1%. When classifying hierarchical relations between entity pairs, it improves the F1 score by 27.6%.

[1]  Yulan He,et al.  Sentence Subjectivity Detection with Weakly-Supervised Learning , 2011, IJCNLP.

[2]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[3]  Jérôme Laurens,et al.  Direct and reverse synchronization with SyncTEX , 2008 .

[4]  Thomas Kieninger,et al.  Applying the T-Recs table recognition system to the business letter domain , 2001, Proceedings of Sixth International Conference on Document Analysis and Recognition.

[5]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Alexey O. Shigarov,et al.  Configurable Table Structure Recognition in Untagged PDF documents , 2016, DocEng.

[7]  Yalin Wang,et al.  Table structure understanding and its performance evaluation , 2004, Pattern Recognit..

[8]  Katharina Kaiser,et al.  pdf2table: A Method to Extract Table Information from PDF Files , 2005, IICAI.

[9]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[10]  David Konopnicki,et al.  Learning Concept Abstractness Using Weak Supervision , 2018, EMNLP.

[11]  Stefan Feuerriegel,et al.  Learning Interpretable Negation Rules via Weak Supervision at Document Level: A Reinforcement Learning Approach , 2019, NAACL.

[12]  Sutanu Chakraborti,et al.  Sprinkling Topics for Weakly Supervised Text Classification , 2014, ACL.

[13]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[14]  York Sure-Vetter,et al.  Transforming arbitrary tables into logical form with TARTAR , 2007, Data Knowl. Eng..

[15]  Miao Fan,et al.  Detecting Table Region in PDF Documents Using Distant Supervision , 2015 .

[16]  Richard Zanibbi,et al.  A survey of table recognition , 2004, Document Analysis and Recognition.

[17]  Steffen Bickel,et al.  Chargrid: Towards Understanding 2D Documents , 2018, EMNLP.

[18]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[19]  Min-Yen Kan,et al.  Logical Structure Recovery in Scholarly Articles with Rich Document Features , 2010, Int. J. Digit. Libr. Syst..

[20]  Saman Arif,et al.  Table Detection in Document Images using Foreground and Background Features , 2018, 2018 Digital Image Computing: Techniques and Applications (DICTA).

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Anssi Nurminen,et al.  Algorithmic extraction of data in tables in PDF documents , 2013 .

[23]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[24]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[26]  Fei Yin,et al.  Page Object Detection from PDF Document Images by Deep Structured Prediction and Supervised Clustering , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[27]  Ulrich Schäfer,et al.  The ACL Anthology Searchbench , 2011, ACL.

[28]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[29]  Ulrich Schäfer,et al.  Combining OCR Outputs for Logical Document Structure Markup. Technical Background to the ACL 2012 Contributed Task , 2012, Discoveries@ACL.

[30]  Jonathan Berant,et al.  Weakly Supervised Semantic Parsing with Abstract Examples , 2017, ACL.

[31]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[32]  Apostolos Antonacopoulos,et al.  A Realistic Dataset for Performance Evaluation of Document Layout Analysis , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[33]  Andreas Dengel,et al.  DeCNT: Deep Deformable CNN for Table Detection , 2018, IEEE Access.

[34]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.