A Hybrid Probabilistic Approach for Table Understanding

Tables of data are used to record vast amounts of socioeconomic, scientific, and governmental information. Although humans create tables using underlying organizational principles, unfortunately AI systems struggle to understand the contents of these tables. This paper introduces an end-to-end system for table understanding, the process of capturing the relational structure of data in tables. We introduce models that identify cell types, group these cells into blocks of data that serve a similar functional role, and predict the relationships between these blocks. We introduce a hybrid, neuro-symbolic approach, combining embedded representations learned from thousands of tables with probabilistic constraints that capture regularities in how humans organize tables. Our neurosymbolic model is better able to capture positional invariants of headers and enforce homogeneity of data types. One limitation in this research area is the lack of rich datasets for evaluating end-to-end table understanding, so we introduce a new benchmark dataset comprised of 431 diverse tables from data.gov. The evaluation results show that our system achieves the state-of-the-art performance on cell type classification, block identification, and relationship prediction, improving over prior efforts by up to 7% of macro F1 score.

[1]  Zhe Chen,et al.  Integrating spreadsheet data via accurate and low-effort extraction , 2014, KDD.

[2]  Sven Behnke,et al.  PyStruct: learning structured prediction in python , 2014, J. Mach. Learn. Res..

[3]  H. Chipman,et al.  Bayesian CART Model Search , 1998 .

[4]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[5]  Yongxuan Lai,et al.  Transforming a Nonstandard Table into Formalized Tables , 2017, 2017 14th Web Information Systems and Applications Conference (WISA).

[6]  Craig A. Knoblock,et al.  Learning Semantic Models of Data Sources Using Probabilistic Graphical Models , 2019, WWW.

[7]  Haixun Wang,et al.  Understanding Tables on the Web , 2012, ER.

[8]  GetoorLise,et al.  Hinge-loss Markov random fields and probabilistic soft logic , 2017 .

[9]  Eric Crestan,et al.  A fine-grained taxonomy of tables on the web , 2010, CIKM '10.

[10]  Pedro A. Szekely,et al.  Tabular Cell Classification Using Pre-Trained Cell Embeddings , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[11]  Dongmei Zhang,et al.  Expandable Group Identification in Spreadsheets , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[13]  Viacheslav V. Paramonov,et al.  Rule-Based Canonicalization of Arbitrary Tables in Spreadsheets , 2016, ICIST.

[14]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[15]  Nataliia Rümmele,et al.  Evaluating Approaches for Supervised Semantic Labeling , 2018, LDOW@WWW.

[16]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[17]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[18]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[19]  Dongmei Zhang,et al.  TableSense: Spreadsheet Table Detection with Convolutional Neural Networks , 2019, AAAI.

[20]  Dongmei Zhang,et al.  Table2Analysis: Modeling and Recommendation of Common Analysis Patterns for Multi-Dimensional Data , 2020, AAAI.

[21]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[22]  Kugatsu Sadamitsu,et al.  Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture , 2017, AAAI.

[23]  James R. Foulds,et al.  HyPER: A Flexible and Extensible Probabilistic Framework for Hybrid Recommender Systems , 2015, RecSys.

[24]  Wolfgang Lehner,et al.  Cell Classification for Layout Recognition in Spreadsheets , 2016, IC3K.

[25]  Wolfgang Lehner,et al.  A Machine Learning Approach for Layout Inference in Spreadsheets , 2016, KDIR.

[26]  Gilles Louppe,et al.  Independent consultant , 2013 .

[27]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[28]  Lise Getoor,et al.  Collective Entity Resolution in Familial Networks , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[29]  Yalin Wang,et al.  A machine learning based approach for table detection on the web , 2002, WWW '02.

[30]  Pedro A. Szekely,et al.  A Common Framework for Developing Table Understanding Models , 2019, SEMWEB.