TABBIE: Pretrained Representations of Tabular Data

Existing work on tabular representation-learning jointly models tables and associated text using self-supervised objective functions derived from pretrained language models such as BERT. While this joint pretraining improves tasks involving paired tables and text (e.g., answering questions about tables), we show that it underperforms on tasks that operate over tables without any associated text (e.g., populating missing cells). We devise a simple pretraining objective (corrupt cell detection) that learns exclusively from tabular data and reaches the state-of-the-art on a suite of table-based prediction tasks. Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures (cells, rows, and columns), and it also requires far less compute to train. A qualitative analysis of our model’s learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.

[1]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[4]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[5]  Chen Liang,et al.  Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing , 2018, NeurIPS.

[6]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[7]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[8]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[9]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[10]  W. Tan,et al.  Sato , 2019, Proc. VLDB Endow..

[11]  Brian L. Price,et al.  Deep Splitting and Merging for Table Structure Decomposition , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[12]  Alexandre Lacoste,et al.  Quantifying the Carbon Emissions of Machine Learning , 2019, ArXiv.

[13]  Tim Kraska,et al.  Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[16]  Krisztian Balog,et al.  Web Table Extraction, Retrieval, and Augmentation: A Survey , 2020, ACM Trans. Intell. Syst. Technol..

[17]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[18]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[19]  Krisztian Balog,et al.  EntiTables: Smart Assistance for Entity-Focused Tables , 2017, SIGIR.

[20]  Tim Kraska,et al.  VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[21]  Krisztian Balog,et al.  Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval , 2019, SIGIR.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  C. V. Jawahar,et al.  Table Structure Recognition using Top-Down and Bottom-Up Cues , 2020, ECCV.

[24]  Kugatsu Sadamitsu,et al.  Understanding the Semantic Structures of Tables with a Hybrid Deep Neural Network Architecture , 2017, AAAI.

[25]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, ArXiv.

[26]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[27]  You Wu,et al.  TURL , 2020, Proc. VLDB Endow..

[28]  AnHai Doan,et al.  Corpus-based schema matching , 2005, 21st International Conference on Data Engineering (ICDE'05).

[29]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[30]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[31]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[32]  Krisztian Balog,et al.  Novel Entity Discovery from Web Tables , 2020, WWW.

[33]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[34]  Erhard Rahm,et al.  Generic Schema Matching with Cupid , 2001, VLDB.

[35]  Sameer Singh,et al.  Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.