On Graph-Based Verification for PDF Table Detection

Many non-editable documents are shared in PDF (Portable Document Format). They are typically not accompanied by tags for annotating the page layout, including table positions. One of the important challenges of the analysis and understanding of such documents is table detection. This paper outlines a novel two-phase approach to the table detection in untagged PDF documents. The first phase uses deep neural networks (DNN) to predict some table candidates. The second phase selects probable tables from the candidates by verifying their graph representation. We build a weighted directed graph from text blocks inside a predicted area of a table. A set of such graphs produced from the “ICDAR 2013 Table Competition” dataset allowed us to train a verification model based on the Random Forest technique. The empirical results for competitive dataset demonstrated high performance of our implementation of this approach. We showed that additional verification enables reduction of errors and improvement of results of the PDF table detection.

[1]  Yibo Li,et al.  A YOLO-Based Table Detection Method , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[2]  Xiaoming Hu,et al.  Faster R-CNN Based Table Detection Combining Corner Locating , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[3]  Yibo Li,et al.  A GAN-Based Feature Generator for Table Detection , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[4]  Chakravarthy Bhagvati,et al.  Parameter-Free Table Detection Method , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[5]  Lovekesh Vig,et al.  TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[6]  ANALYSIS OF DOCUMENTS , 2019, Lorenzo Ghiberti.

[7]  Hye-Young Paik,et al.  TEXUS: A unified framework for extracting and understanding tables in PDF documents , 2019, Inf. Process. Manag..

[8]  Saman Arif,et al.  Table Detection in Document Images using Foreground and Background Features , 2018, 2018 Digital Image Computing: Techniques and Applications (DICTA).

[9]  Andreas Dengel,et al.  DeCNT: Deep Deformable CNN for Table Detection , 2018, IEEE Access.

[10]  Viacheslav Paramonov,et al.  TabbyPDF: Web-Based System for PDF Table Extraction , 2018, ICIST.

[11]  Hye-Young Paik,et al.  TEXUS: Table Extraction System for PDF Documents , 2018, ADC.

[12]  Concetto Spampinato,et al.  A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents , 2018, ICIAP.

[13]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[14]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[15]  Andreas Dengel,et al.  Table Recognition in Heterogeneous Documents Using Machine Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[16]  Daniel Kifer,et al.  Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[17]  Zhi Tang,et al.  ICDAR2017 Competition on Page Object Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[18]  Andreiwid Sheffer Corrêa,et al.  Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools , 2017, DG.O.

[19]  Giorgio Orsi,et al.  Table Modelling, Extraction and Processing , 2016, DocEng.

[20]  Alexey O. Shigarov,et al.  Configurable Table Structure Recognition in Untagged PDF documents , 2016, DocEng.

[21]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[22]  Trilce Estrada,et al.  TAO: System for Table Detection and Extraction from PDF Documents , 2016, FLAIRS.

[23]  In Seop Na,et al.  Table Detection from Document Image using Vertical Arrangement of Text Blocks , 2015 .

[24]  Julius T. Nganji,et al.  The Portable Document Format (PDF) accessibility practice of four journal publishers , 2015 .

[25]  Miao Fan,et al.  Detecting Table Region in PDF Documents Using Distant Supervision , 2015 .

[26]  Shah Khusro,et al.  On methods and tools of table detection, extraction and annotation in PDF documents , 2015, J. Inf. Sci..

[27]  Gaurav Harit,et al.  Table Extraction from Document Images using Fixed Point Model , 2014, ICVGIP.

[28]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[29]  Giorgio Orsi,et al.  A methodology for evaluating algorithms for table understanding in PDF documents , 2012, DocEng '12.

[30]  Takahiro Watanabe,et al.  Document Analysis and Recognition , 1999, Communications in Computer and Information Science.

[31]  Thomas Kieninger,et al.  The T-Recs Table Recognition and Analysis System , 1998, Document Analysis Systems.

[32]  Hyoseok Hwang,et al.  A Rule-Based Method for Table Detection in Website Images , 2020, IEEE Access.

[33]  Ying Liu,et al.  Analysis of Documents Born Digital , 2014, Handbook of Document Image Processing and Recognition.

[34]  Tapio Elomaa,et al.  ANSSI NURMINEN ALGORITHMIC EXTRACTION OF DATA IN TABLES IN PDF DOCUMENTS , 2013 .