Tablext: A Combined Neural Network And Heuristic Based Table Extractor

A significant portion of the data available today is found within tables. Therefore, it is necessary to use automated table extraction to obtain thorough results when data-mining. Today’s popular state-of-the-art methods for table extraction struggle to adequately extract tables with machine-readable text and structural data. To make matters worse, many tables do not have machine-readable data, such as tables saved as images, making most extraction methods completely ineffective. In order to address these issues, a novel, general format table extractor tool, Tablext, is proposed. This tool uses a combination of computer vision techniques and machine learning methods to efficiently and effectively identify and extract data from tables. Tablext begins by using a custom Convolutional Neural Network (CNN) to identify and separate all potential tables. The identification process is optimized by combining the custom CNN with the YOLO object detection network. Then, the highlevel structure of each table is identified with computer vision methods. This high-level, structural meta-data is used by another CNN to identify exact cell locations. As a final step, Optical Characters Recognition (OCR) is performed on every individual cell to extract their content without needing machine-readable text. This multi-stage algorithm allows for the neural networks to focus on completing complex tasks, while letting image processing methods efficiently complete the simpler ones. This leads to the proposed approach to be general-purpose enough to handle a large batch of tables regardless of their internal encodings or their layout complexity. Additionally, it becomes accurate enough to outperform competing state-of-the-art table extractors on the ICDAR 2013 table dataset.

[1]  Clément Chatelain,et al.  Learning to Detect Tables in Scanned Document Images Using Line Information , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[2]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[3]  Faisal Shafait,et al.  Rethinking Table Recognition using Graph Neural Networks , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[4]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[5]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.

[6]  Vlad Posea,et al.  Enhancing Open Data Knowledge by Extracting Tabular Data from Text Images , 2018, DATA.

[7]  Katharina Kaiser,et al.  pdf2table: A Method to Extract Table Information from PDF Files , 2005, IICAI.

[8]  Tamir Hassan,et al.  Table Recognition and Understanding from PDF Files , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[10]  Lovekesh Vig,et al.  TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[11]  Massimo Ruffolo,et al.  PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[12]  Ioannis Pratikakis,et al.  Automatic Table Detection in Document Images , 2005, ICAPR.

[13]  Yiming Yang,et al.  Learning Table Extraction from Examples , 2004, COLING.

[14]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[15]  Francesca Cesarini,et al.  Trainable Table Location in Document Images , 2002, ICPR.

[16]  Wolfgang Lehner,et al.  Table Recognition in Spreadsheets via a Graph Representation , 2018, 2018 13th IAPR International Workshop on Document Analysis Systems (DAS).

[17]  Miao Fan,et al.  Table Region Detection on Large-scale PDF Files without Labeled Data , 2015, ArXiv.

[18]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[19]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[20]  David Blaauw,et al.  Fully-Autonomous SoC Synthesis using Customizable Cell-Based Synthesizable Analog Circuits , 2019 .

[21]  Kun Bai,et al.  Searching for Tables in Digital Documents , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[22]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[23]  Faisal Shafait,et al.  Table detection in heterogeneous documents , 2010, DAS '10.

[24]  David A. Forsyth,et al.  Shape, Contour and Grouping in Computer Vision , 1999, Lecture Notes in Computer Science.

[25]  Kugatsu Sadamitsu,et al.  Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture , 2017, AAAI.

[26]  Trilce Estrada,et al.  TAO: System for Table Detection and Extraction from PDF Documents , 2016, FLAIRS.

[27]  Ehsan Afshari,et al.  Applications of Artificial Intelligence on the Modeling and Optimization for Analog and Mixed-Signal Circuits: A Review , 2021, IEEE Transactions on Circuits and Systems I: Regular Papers.