Tabular Functional Block Detection with Embedding-based Agglomerative Cell Clustering

Tables are a widely-used format for data curation. The diversity of domains, layouts, and content of tables makes knowledge extraction challenging. Understanding table layouts is an important step for automatically harvesting knowledge from tabular data. Since table cells are spatially organized into regions, correctly identifying such regions and inferring their functional roles, referred to as functional block detection, is a critical part of understanding table layouts. Earlier functional block detection approaches fail to leverage spatial relationships and higher-level structure, either depending on cell-level predictions or relying on data types as signals for identifying blocks. In this paper, we introduce a flexible functional block detection method by applying agglomerative clustering techniques which merge smaller blocks into larger blocks using two merging strategies. Our proposed method uses cell embeddings with a customized dissimilarity function which utilizes local and margin distances, as well as block coherence metrics to capture cell, block, and table scoped features. Given the diversity of tables in real-world corpora, we also introduce a sampling-based approach for automatically tuning distance thresholds for each table. Experimental results show that our method improves over the earlier state-of-the-art method in terms of several evaluation metrics.

[1]  Zhe Chen,et al.  Integrating spreadsheet data via accurate and low-effort extraction , 2014, KDD.

[2]  Lovekesh Vig,et al.  TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[3]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[4]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[5]  Craig A. Knoblock,et al.  Semantic Labeling: A Domain-Independent Approach , 2016, SEMWEB.

[6]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[7]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Andreas Dengel,et al.  DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[10]  Dongmei Zhang,et al.  TableSense: Spreadsheet Table Detection with Convolutional Neural Networks , 2019, AAAI.

[11]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[12]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[13]  Wenhu Chen,et al.  TabFact: A Large-scale Dataset for Table-based Fact Verification , 2019, ICLR.

[14]  Konstantin Zuyev Table image segmentation , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[15]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[16]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[19]  Y. Hirayama,et al.  A method for table structure analysis using DP matching , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[20]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Ioannis Pratikakis,et al.  Automatic Table Detection in Document Images , 2005, ICAPR.

[23]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[24]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[25]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[26]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[27]  Pedro A. Szekely,et al.  A Common Framework for Developing Table Understanding Models , 2019, SEMWEB.

[28]  Muhammad Imran Malik,et al.  Table Detection Using Deep Learning , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[29]  Daniel Müllner,et al.  Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[30]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[31]  Daniel P. Lopresti,et al.  Medium-independent table detection , 1999, Electronic Imaging.

[32]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[33]  Zhi Tang,et al.  A Table Detection Method for PDF Documents Based on Convolutional Neural Networks , 2016, 2016 12th IAPR Workshop on Document Analysis Systems (DAS).

[34]  You Wu,et al.  TURL , 2020, Proc. VLDB Endow..

[35]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[36]  Faisal Shafait,et al.  Table detection in heterogeneous documents , 2010, DAS '10.

[37]  Pedro A. Szekely,et al.  Retrieving Complex Tables with Multi-Granular Graph Representation Learning , 2021, SIGIR.

[38]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[39]  GetoorLise,et al.  Hinge-loss Markov random fields and probabilistic soft logic , 2017 .

[40]  Pedro A. Szekely,et al.  Tabular Cell Classification Using Pre-Trained Cell Embeddings , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[41]  Jay Pujara,et al.  A Hybrid Probabilistic Approach for Table Understanding , 2021, AAAI.

[42]  Nathalie Vauquier,et al.  metric-learn: Metric Learning Algorithms in Python , 2019, J. Mach. Learn. Res..

[43]  Wolfgang Lehner,et al.  Table Identification and Reconstruction in Spreadsheets , 2017, CAiSE.

[44]  Wolfgang Lehner,et al.  Cell Classification for Layout Recognition in Spreadsheets , 2016, IC3K.