DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks

Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging, and translation. Recent works have mostly focused on the translation step overlooking the earlier steps by using adhoc solutions. In the pipeline, one of the most critical and challenging problems is keyword mapping; constructing a mapping between tokens in the query and relational database elements (tables, attributes, values, etc.). We define the keyword mapping problem as a sequence tagging problem, and propose a novel deep learning based supervised approach that utilizes POS tags of NLQs. Our proposed approach, called DBTagger (DataBase Tagger), is an end-to-end and schema independent solution, which makes it practical for various relational databases. We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, 92.4% on the average. Our results also indicate that DBTagger is faster than its counterparts up to 10000 times and scalable for bigger databases. PVLDB Reference Format: Arif Usta, Akifhan Karakayali, and Özgür Ulusoy. DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent

[1]  Tao Yu,et al.  SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task , 2018, EMNLP.

[2]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[3]  Abdul Quamar,et al.  State of the Art and Open Challenges in Natural Language Interfaces to Data , 2020, SIGMOD Conference.

[4]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[5]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[6]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[7]  Umar Farooq Minhas,et al.  ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores , 2016, Proc. VLDB Endow..

[8]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[9]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[10]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Tao Yu,et al.  TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation , 2018, NAACL.

[12]  Graham Neubig,et al.  TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data , 2020, ACL.

[13]  Donald Kossmann,et al.  SODA: Generating SQL for Business Users , 2012, Proc. VLDB Endow..

[14]  Carsten Binnig,et al.  DBPal: A Learned NL-Interface for Databases , 2018, SIGMOD Conference.

[15]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[16]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[18]  H. V. Jagadish,et al.  Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[19]  Abraham Bernstein,et al.  A comparative survey of recent natural language interfaces for databases , 2019, The VLDB Journal.

[20]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[21]  Hyeonji Kim,et al.  Natural language to SQL: Where are we today? , 2020, Proc. VLDB Endow..

[22]  Po-Sen Huang,et al.  Natural Language to Structured Query Generation via Meta-Learning , 2018, NAACL.

[23]  Xifeng Yan,et al.  What It Takes to Achieve 100% Condition Accuracy on WikiSQL , 2018, EMNLP.

[24]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[25]  Gary G. Hendrix,et al.  Developing a natural language interface to complex data , 1977, TODS.

[26]  Dawn Xiaodong Song,et al.  SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning , 2017, ArXiv.

[27]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[28]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[29]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jonathan Berant,et al.  Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing , 2019, ACL.

[32]  Yunyao Li,et al.  Natural Language Data Management and Interfaces: Recent Development and Open Challenges , 2017, SIGMOD Conference.

[33]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[34]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[35]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[36]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[37]  Yan Gao,et al.  Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation , 2019, ACL.

[38]  Oren Etzioni,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[39]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[40]  Geoffrey Zweig,et al.  Spoken language understanding using long short-term memory neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[41]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[42]  Carsten Binnig,et al.  DBPal: A Fully Pluggable NL2SQL Training Pipeline , 2020, SIGMOD Conference.

[43]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..