Annotating Columns with Pre-trained Language Models

Inferring meta information about tables, such as column headers or relationships between columns, is an active research topic in data management as we find many tables are missing some of these information. In this paper, we study the problem of annotating table columns (i.e., predicting column types and the relationships between columns) using only information from the table itself. We show that a multi-task learning approach (called Doduo), trained using pre-trained language models on both tasks outperforms individual learning approaches. Experimental results show that Doduo establishes new state-of-the-art performance on two benchmarks for the column type prediction and column relation prediction tasks with up to 4.0% and 11.9% improvements, respectively. We also establish that Doduo can already perform the previous stateof-the-art performance with a minimal number of tokens, only 8 tokens per column. PVLDB Reference Format: Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen Chen, and Wang-Chiew Tan. Annotating Columns with Pre-trained Language Models. PVLDB, 14(1): XXX-XXX, 2020.

[1]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[2]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[3]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[4]  Oren Etzioni,et al.  TextRunner: Open Information Extraction on the Web , 2007, NAACL.

[5]  Paolo Merialdo,et al.  Towards Annotating Relational Data on the Web with Language Models , 2018, WWW.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[8]  Rich Caruana,et al.  Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[9]  Ursin Brunner,et al.  Entity Matching with Transformer Architectures - A Step Forward in Data Integration , 2020, EDBT.

[10]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[11]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[12]  Elena Simperl,et al.  Dataset search: a survey , 2019, The VLDB Journal.

[13]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[14]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[15]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[16]  You Wu,et al.  TURL , 2020, Proc. VLDB Endow..

[17]  Alessandra Mileo,et al.  Using linked data to mine RDF from wikipedia's tables , 2014, WSDM.

[18]  Paolo Papotti,et al.  Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks , 2020, SIGMOD Conference.

[19]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[20]  Jeff Heflin,et al.  Semantic Labeling Using a Deep Contextualized Language Model , 2020, ArXiv.

[21]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[22]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  Structure-aware Pre-training for Table Understanding with Tree-based Transformers , 2020, ArXiv.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Tim Kraska,et al.  Sherlock: A Deep Learning Approach to Semantic Data Type Detection , 2019, KDD.

[26]  Tim Kraska,et al.  VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[29]  Guoliang Li,et al.  Relational Pretrained Transformers towards Democratizing Data Preparation [Vision] , 2020, ArXiv.

[30]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[31]  Ian Horrocks,et al.  Learning Semantic Annotations for Tabular Data , 2019, IJCAI.

[32]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[33]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Felix Bießmann,et al.  Automating Large-Scale Data Quality Verification , 2018, Proc. VLDB Endow..

[35]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[36]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval , 2008, NAACL.

[37]  Michael Stonebraker,et al.  Raha: A Configuration-Free Error Detection System , 2019, SIGMOD Conference.

[38]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[39]  Shinji Nakadai,et al.  Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables , 2019, AAAI.

[40]  W. Tan,et al.  Sato , 2019, Proc. VLDB Endow..

[41]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[42]  Graham Neubig,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[43]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[44]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[45]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[46]  Ian Horrocks,et al.  ColNet: Embedding the Semantics of Web Tables for Column Type Prediction , 2018, AAAI.

[47]  Yongxin Yang,et al.  Trace Norm Regularised Deep Multi-Task Learning , 2016, ICLR.

[48]  W. Tan,et al.  Deep entity matching with pre-trained language models , 2020, Proc. VLDB Endow..

[49]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[50]  Gerhard Weikum,et al.  PATTY: A Taxonomy of Relational Patterns with Semantic Types , 2012, EMNLP.

[51]  Ziawasch Abedjan,et al.  Baran , 2020, Proc. VLDB Endow..