A Rule-Based Method for Table Detection in Website Images

Table detection is an essential part of a document analysis because tables are among the most efficient methods for systematically summarizing information. Therefore, numerous studies on detecting tables not only from documents but also from websites have been conducted. Although, the number of websites has been growing explosively recently, most of these studies suffer from detecting tables which are image types rather than tagging due to the variability of size, contents, color, and shapes. In this paper, we propose an efficient yet robust method for detecting tables in image formats, which can apply to both documents and websites. Instead of employing recently developed deep learning methods, which require extensive training for diversity, we apply a rule-based detection method by using key features of many tables, namely, the grid format of the text provided in the tables. The proposed method consists of two stages: a feature extraction stage and a grid pattern recognition stage. In the first stage, we extract the features of the contents in the tables. We then remove the features of non-text objects and texts not included in tables. In the second stage, we build tree structures from the features and apply a novel algorithm for determining the grid pattern. When we applied our method to a website dataset, the experimental results showed a precision, recall, and F1-measure of 84.5%, 72%, and 0.778, which are improvements of 3.6%, 24.16%, and 0.276 over a previous method, respectively, while also achieving the fastest processing time. In addition, the proposed rule-based method allows the structure of the contents in the table to be easily restored.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Kyong-Ho Lee,et al.  Detecting tables in Web documents , 2005, Eng. Appl. Artif. Intell..

[3]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Faisal Shafait,et al.  Table detection in heterogeneous documents , 2010, DAS '10.

[5]  Yalin Wang,et al.  Detecting Tables in HTML Documents , 2002, Document Analysis Systems.

[6]  Sekhar Mandal,et al.  A simple and effective table detection system from document images , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Zhoujun Li,et al.  TableBank: Table Benchmark for Image-based Table Detection and Recognition , 2019, LREC.

[9]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[10]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ana Costa e Silva,et al.  2009 10th International Conference on Document Analysis and Recognition Learning Rich Hidden Markov Models in Document Analysis: Table Location , 2022 .

[12]  Wolfgang Gatterbauer,et al.  Using visual cues for extraction of tabular data from arbitrary HTML documents , 2005, WWW '05.

[13]  Katsuhiko Itonori,et al.  Table structure recognition based on textblock arrangement and ruled line position , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[14]  Rangachar Kasturi,et al.  Structural recognition of tabulated data , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[15]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[16]  Eric Crestan,et al.  Web-scale table census and classification , 2011, WSDM '11.

[17]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[18]  Tarak Gandhi,et al.  Structure recognition and information extraction from tabular documents , 1996, Int. J. Imaging Syst. Technol..

[19]  Clément Chatelain,et al.  Learning to Detect Tables in Scanned Document Images Using Line Information , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[20]  Miao Fan,et al.  Detecting Table Region in PDF Documents Using Distant Supervision , 2015 .

[21]  Hsin-Hsi Chen,et al.  Mining Tables from Large Scale HTML Texts , 2000, COLING.

[22]  Aurélie Lemaitre,et al.  Recognition of Tables and Forms , 2014, Handbook of Document Image Processing and Recognition.

[23]  Zhi Tang,et al.  ICDAR2017 Competition on Page Object Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[24]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[25]  Tamir Hassan,et al.  ICDAR 2013 Table Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[26]  Alex Zelinsky,et al.  Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf] , 2009, IEEE Robotics & Automation Magazine.

[27]  Marcus Herzog,et al.  Visually guided bottom-up table detection and segmentation in web documents , 2006, WWW '06.

[28]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[29]  David Doermann,et al.  Handbook of Document Image Processing and Recognition , 2014, Springer London.

[30]  Francesca Cesarini,et al.  Trainable Table Location in Document Images , 2002, ICPR.

[31]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[32]  R. Smith,et al.  An Overview of the Tesseract OCR Engine , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[33]  Waleed Ammar,et al.  Extracting Scientific Figures with Distantly Supervised Neural Networks , 2018, JCDL.

[34]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[35]  Serge Beucher Segmentation d'images et morphologie mathématique , 1990 .

[36]  Chunheng Wang,et al.  Text detection in images based on unsupervised classification of edge-based features , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).