DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset

Due to the lack of labeled data, previous research on text-to-SQL parsing mainly focuses on English. Representative English datasets include ATIS, WikiSQL, Spider, etc. This paper presents DuSQL, a larges-scale and pragmatic Chinese dataset for the cross-domain text-to-SQL task, containing 200 databases, 813 tables, and 23,797 question/SQL pairs. Our new dataset has three major characteristics. First, by manually analyzing questions from several representative applications, we try to figure out the true distribution of SQL queries in real-life needs. Second, DuSQL contains a considerable proportion of SQL queries involving row or column calculations, motivated by our analysis on the SQL query distributions. Finally, we adopt an effective data construction framework via human-computer collaboration. The basic idea is automatically generating SQL queries based on the SQL grammar and constrained by the given database. This paper describes in detail the construction process and data statistics of DuSQL. Moreover, we present and compare performance of several open-source text-to-SQL parsers with minor modification to accommodate Chinese, including a simple yet effective extension to IRNet for handling calculation SQL queries.

[1]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[2]  Yan Gao,et al.  Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation , 2019, ACL.

[3]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4]  Ming Zhou,et al.  Semantic Parsing with Syntax- and Table-Aware SQL Generation , 2018, ACL.

[5]  Oren Etzioni,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[6]  Kristian J. Hammond,et al.  Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System , 1997, AI Mag..

[7]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[8]  Dragomir R. Radev,et al.  Improving Text-to-SQL Evaluation Methodology , 2018, ACL.

[9]  Jonathan Berant,et al.  Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing , 2019, ACL.

[10]  Jonathan Berant,et al.  Don’t paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing , 2019, EMNLP.

[11]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[12]  Jonathan Berant,et al.  Building a Semantic Parser Overnight , 2015, ACL.

[13]  Tao Yu,et al.  SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task , 2018, EMNLP.

[14]  Yue Zhang,et al.  A Pilot Study for Chinese SQL Semantic Parsing , 2019, EMNLP.

[15]  Dawn Xiaodong Song,et al.  SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning , 2017, ArXiv.

[16]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[17]  Kaushik Chakrabarti,et al.  X-SQL: reinforce schema representation with context , 2019, ArXiv.

[18]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[19]  Raymond J. Mooney,et al.  Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing , 2001, ECML.

[20]  M. Haggag,et al.  The Question Answering Systems : A Survey . , 2016 .

[21]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[22]  Yunfeng Liu,et al.  TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation , 2020, ArXiv.

[23]  Xiaodong Liu,et al.  RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers , 2019, ACL.

[24]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[25]  Percy Liang,et al.  Lambda Dependency-Based Compositional Semantics , 2013, ArXiv.

[26]  Mirella Lapata,et al.  Language to Logical Form with Neural Attention , 2016, ACL.

[27]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[28]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..

[29]  Tong Guo,et al.  Content Enhanced BERT-based Text-to-SQL Generation , 2019, ArXiv.