Retrieving Complex Tables with Multi-Granular Graph Representation Learning

The task of natural language table retrieval (NLTR) seeks to retrieve semantically relevant tables based on natural language queries. Existing learning systems for this task often treat tables as plain text based on the assumption that tables are structured as dataframes. However, tables can have complex layouts which indicate diverse dependencies between subtable structures, such as nested headers. As a result, queries may refer to different spans of relevant content that is distributed across these structures. Moreover, such systems fail to generalize to novel scenarios beyond those seen in the training set. Prior methods are still distant from a generalizable solution to the NLTR problem, as they fall short in handling complex table layouts or queries over multiple granularities. To address these issues, we propose Graph-based Table Retrieval (GTR), a generalizable NLTR framework with multi-granular graph representation learning. In our framework, a table is first converted into a tabular graph, with cell nodes, row nodes and column nodes to capture content at different granularities. Then the tabular graph is input to a Graph Transformer model that can capture both table cell content and the layout structures. To enhance the robustness and generalizability of the model, we further incorporate a self-supervised pre-training task based on graph-context matching. Experimental results on two benchmarks show that our method leads to significant improvements over the current state-of-the-art systems. Further experiments demonstrate promising performance of our method on cross-dataset generalization, and enhanced capability of handling complex tables and fulfilling diverse query intents.

[1]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[2]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[3]  Kun Bai,et al.  TableRank: A Ranking Algorithm for Table Search and Retrieval , 2007, AAAI.

[4]  Sunita Sarawagi,et al.  Answering Table Queries on the Web using Column Keywords , 2012, Proc. VLDB Endow..

[5]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[6]  Stephen J. Wright Coordinate descent algorithms , 2015, Mathematical Programming.

[7]  Jay Pujara,et al.  A Hybrid Probabilistic Approach for Table Understanding , 2021, AAAI.

[8]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[9]  Ming Gong,et al.  A Graph Representation of Semi-structured Data for Web Question Answering , 2020, COLING.

[10]  Mustafa Canim,et al.  Ad Hoc Table Retrieval using Intrinsic and Extrinsic Similarities , 2020, WWW.

[11]  Jian Tang,et al.  InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization , 2019, ICLR.

[12]  Dan Roth,et al.  Joint Constrained Learning for Event-Event Relation Extraction , 2020, EMNLP.

[13]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[14]  Mustafa Canim,et al.  Web Table Retrieval using Multimodal Deep Learning , 2020, SIGIR.

[15]  Wen-tau Yih,et al.  Joint Verification and Reranking for Open Fact Checking Over Tables , 2020, ACL.

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Wenhu Chen,et al.  Open Question Answering over Tables and Text , 2020, ArXiv.

[18]  Dawn Xiaodong Song,et al.  SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning , 2017, ArXiv.

[19]  Brian D. Davison,et al.  Table Search Using a Deep Contextualized Language Model , 2020, SIGIR.

[20]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[21]  Maneesh Agrawala,et al.  Facilitating Document Reading by Linking Text and Tables , 2018, UIST.

[22]  Michael Stonebraker,et al.  The design and implementation of INGRES , 1976, TODS.

[23]  Tiejun Zhao,et al.  Table-to-Text: Describing Table Region With Natural Language , 2018, AAAI.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[26]  Moshé M. Zloof Query-by-example: the invocation and definition of tables and forms , 1975, VLDB '75.

[27]  Qianchu Liu,et al.  Towards Better Context-aware Lexical Semantics: Adjusting Contextualized Representations through Static Anchors , 2020, EMNLP.

[28]  Pedro A. Szekely,et al.  TabVec: Table Vectors for Classification of Web Tables , 2018, ArXiv.

[29]  Krisztian Balog,et al.  Ad Hoc Table Retrieval using Semantic Similarity , 2018, WWW.

[30]  Pedro A. Szekely,et al.  Tabular Cell Classification Using Pre-Trained Cell Embeddings , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[31]  Jayant Madhavan,et al.  Applying WebTables in Practice , 2015, CIDR.

[32]  Xu Sun,et al.  A Neural Question Answering Model Based on Semi-Structured Tables , 2018, COLING.

[33]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[34]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[35]  Lingfan Yu,et al.  Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. , 2019 .

[36]  Krisztian Balog,et al.  Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval , 2019, SIGIR.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[39]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[40]  Doug Downey,et al.  Methods for exploring and mining tables on Wikipedia , 2013, IDEA@KDD.

[41]  Wei Wang,et al.  Mutation effect estimation on protein–protein interactions using deep contextualized representation learning , 2020, NAR genomics and bioinformatics.

[42]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[43]  Lucian Popa,et al.  Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context , 2020, ArXiv.

[44]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[45]  Wenhu Chen,et al.  HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data , 2020, EMNLP.

[46]  Wenhu Chen,et al.  Logical Natural Language Generation from Open-Domain Tables , 2020, ACL.

[47]  Thomas Muller,et al.  TaPas: Weakly Supervised Table Parsing via Pre-training , 2020, ACL.

[48]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[49]  Max Welling,et al.  Variational Graph Auto-Encoders , 2016, ArXiv.

[50]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[51]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[52]  Jure Leskovec,et al.  Strategies for Pre-training Graph Neural Networks , 2020, ICLR.

[53]  Roee Shraga,et al.  Projection-based Relevance Model for Table Retrieval , 2020, WWW.

[54]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[55]  Pietro Liò,et al.  Deep Graph Infomax , 2018, ICLR.

[56]  Zhi Tang,et al.  Table Header Detection and Classification , 2012, AAAI.

[57]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[58]  Zhoujun Li,et al.  Content-Based Table Retrieval for Web Queries , 2017, ArXiv.

[59]  Wenhu Chen,et al.  TabFact: A Large-scale Dataset for Table-based Fact Verification , 2019, ICLR.

[60]  Alon Y. Halevy,et al.  Data Integration for the Relational Web , 2009, Proc. VLDB Endow..

[61]  Yizhou Sun,et al.  GPT-GNN: Generative Pre-Training of Graph Neural Networks , 2020, KDD.

[62]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[63]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[64]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[65]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.