TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation

Parsing natural language to corresponding SQL (NL2SQL) with data driven approaches like deep neural networks attracts much attention in recent years. Existing NL2SQL datasets assume that condition values should appear exactly in natural language questions and the queries are answerable given the table. However, these assumptions may fail in practical scenarios, because user may use different expressions for the same content in the table, and query information outside the table without the full picture of contents in table. Therefore we present TableQA, a large-scale cross-domain Natural Language to SQL dataset in Chinese language consisting 64,891 questions and 20,311 unique SQL queries on over 6,000 tables. Different from exisiting NL2SQL datasets, TableQA requires to generalize well not only to SQL skeletons of different questions and table schemas, but also to the various expressions for condition values. Experiment results show that the state-of-the-art model with 95.1% condition value accuracy on WikiSQL only gets 46.8% condition value accuracy and 43.0% logic form accuracy on TableQA, indicating the proposed dataset is challenging and necessary to handle. Two table-aware approaches are proposed to alleviate the problem, the end-to-end approaches obtains 51.3% and 47.4% accuracy on the condition value and logic form tasks, with improvement of 4.7% and 3.4% respectively.

[1]  Jayant Krishnamurthy,et al.  Neural Semantic Parsing with Type Constraints for Semi-Structured Tables , 2017, EMNLP.

[2]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[4]  Alexander Shvets,et al.  Improving Scientific Article Visibility by Neural Title Simplification , 2019, BIR@ECIR.

[5]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[6]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[7]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[8]  Raymond J. Mooney,et al.  Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing , 2001, ECML.

[9]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[10]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[11]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[12]  Dawn Xiaodong Song,et al.  SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning , 2017, ArXiv.

[13]  Eunsol Choi,et al.  Scaling Semantic Parsers with On-the-Fly Ontology Matching , 2013, EMNLP.

[14]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[15]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[16]  Enrico Motta,et al.  Evaluating question answering over linked data , 2013, J. Web Semant..

[17]  Jason Weston,et al.  Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  William A. Woods,et al.  Progress in natural language understanding: an application to lunar geology , 1973, AFIPS National Computer Conference.

[20]  Rishabh Singh,et al.  Robust Text-to-SQL Generation with Execution-Guided Decoding , 2018, 1807.03100.

[21]  Ming-Wei Chang,et al.  The Value of Semantic Parse Labeling for Knowledge Base Question Answering , 2016, ACL.

[22]  Jens Lehmann,et al.  LC-QuAD: A Corpus for Complex Question Answering over Knowledge Graphs , 2017, SEMWEB.

[23]  Dragomir R. Radev,et al.  Improving Text-to-SQL Evaluation Methodology , 2018, ACL.

[24]  Ming Zhou,et al.  Semantic Parsing with Syntax- and Table-Aware SQL Generation , 2018, ACL.

[25]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[26]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[27]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..