DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks

Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging, and translation. Recent works have mostly focused on the translation step overlooking the earlier steps by using ad-hoc solutions. In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.). We define the keyword mapping problem as a sequence tagging problem, and propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. Our proposed approach, called \textit{DBTagger} (DataBase Tagger), is an end-to-end and schema independent solution, which makes it practical for various relational databases. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, $92.4\%$ on the average. Our results also indicate that DBTagger is faster than its counterparts up to $10000$ times and scalable for bigger databases.

[1]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[6]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[7]  Hyeonji Kim,et al.  Natural language to SQL: Where are we today? , 2020, Proc. VLDB Endow..

[8]  Henry A. Kautz,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[9]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[10]  Carsten Binnig,et al.  DBPal: A Learned NL-Interface for Databases , 2018, SIGMOD Conference.

[11]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[12]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..

[13]  Xifeng Yan,et al.  What It Takes to Achieve 100% Condition Accuracy on WikiSQL , 2018, EMNLP.

[14]  Yunyao Li,et al.  Natural Language Data Management and Interfaces: Recent Development and Open Challenges , 2017, SIGMOD Conference.

[15]  Po-Sen Huang,et al.  Natural Language to Structured Query Generation via Meta-Learning , 2018, NAACL.

[16]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[17]  Jonathan Berant,et al.  Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing , 2019, ACL.

[18]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[19]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[20]  Yan Gao,et al.  Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation , 2019, ACL.

[21]  H. V. Jagadish,et al.  Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[22]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[23]  Tao Yu,et al.  TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation , 2018, NAACL.

[24]  Abraham Bernstein,et al.  A comparative survey of recent natural language interfaces for databases , 2019, The VLDB Journal.

[25]  Tao Yu,et al.  SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task , 2018, EMNLP.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[28]  Dawn Xiaodong Song,et al.  SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning , 2017, ArXiv.

[29]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[30]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[31]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[32]  Carsten Binnig,et al.  DBPal: A Fully Pluggable NL2SQL Training Pipeline , 2020, SIGMOD Conference.

[33]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[35]  Donald Kossmann,et al.  SODA: Generating SQL for Business Users , 2012, Proc. VLDB Endow..

[36]  Abdul Quamar,et al.  State of the Art and Open Challenges in Natural Language Interfaces to Data , 2020, SIGMOD Conference.

[37]  Umar Farooq Minhas,et al.  ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores , 2016, Proc. VLDB Endow..

[38]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[39]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[40]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[41]  Gary G. Hendrix,et al.  Developing a natural language interface to complex data , 1977, TODS.

[42]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[43]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.