Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases

A critical challenge in constructing a natural language interface to database (NLIDB) is bridging the semantic gap between a natural language query (NLQ) and the underlying data. Two specific ways this challenge exhibits itself is through keyword mapping and join path inference. Keyword mapping is the task of mapping individual keywords in the original NLQ to database elements (such as relations, attributes or values). It is challenging due to the ambiguity in mapping the user's mental model and diction to the schema definition and contents of the underlying database. Join path inference is the process of selecting the relations and join conditions in the FROM clause of the final SQL query, and is difficult because NLIDB users lack the knowledge of the database schema or SQL and therefore cannot explicitly specify the intermediate tables and joins needed to construct a final SQL query. In this paper, we propose leveraging information from the SQL query log of a database to enhance the performance of existing NLIDBs with respect to these challenges. We present a system Templar that can be used to augment existing NLIDBs. Our extensive experimental evaluation demonstrates the effectiveness of our approach, leading up to 138% improvement in top-1 accuracy in existing NLIDBs by leveraging SQL query log information.

[1]  H. V. Jagadish,et al.  NaLIX: an interactive natural language interface for querying XML , 2005, SIGMOD '05.

[2]  Umar Farooq Minhas,et al.  ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data Stores , 2016, Proc. VLDB Endow..

[3]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[4]  Yoshimasa Tsuruoka,et al.  A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks , 2016, EMNLP.

[5]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[6]  Sandeep Tata,et al.  SQAK: doing more with keywords , 2008, SIGMOD Conference.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[9]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[10]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[11]  Alvin Cheung,et al.  Learning a Neural Semantic Parser from User Feedback , 2017, ACL.

[12]  Dan Suciu,et al.  SnipSuggest: Context-Aware Autocompletion for SQL , 2010, Proc. VLDB Endow..

[13]  Neoklis Polyzotis,et al.  QueRIE: Collaborative Database Exploration , 2014, IEEE Transactions on Knowledge and Data Engineering.

[14]  NAVID YAGHMAZADEH,et al.  SQLizer: query synthesis from natural language , 2017, Proc. ACM Program. Lang..

[15]  Dawn Xiaodong Song,et al.  SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning , 2017, ArXiv.

[16]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[17]  Oren Etzioni,et al.  Towards a theory of natural language interfaces to databases , 2003, IUI '03.

[18]  George Markowsky,et al.  A fast algorithm for Steiner trees , 1981, Acta Informatica.

[19]  Sonia Bergamaschi,et al.  Combining user and database perspective for solving keyword queries over relational databases , 2016, Inf. Syst..

[20]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[21]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[22]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[23]  Surajit Chaudhuri,et al.  DBXplorer: enabling keyword search over relational databases , 2002, SIGMOD '02.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Donald Kossmann,et al.  SODA: Generating SQL for Business Users , 2012, Proc. VLDB Endow..

[26]  S. Sudarshan,et al.  Keyword searching and browsing in databases using BANKS , 2002, Proceedings 18th International Conference on Data Engineering.

[27]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[28]  Yunyao Li,et al.  Natural Language Data Management and Interfaces: Recent Development and Open Challenges , 2017, SIGMOD Conference.

[29]  Traian Rebedea,et al.  Dataset for a Neural Natural Language Interface for Databases (NNLIDB) , 2017, IJCNLP.

[30]  Carsten Binnig,et al.  An End-to-end Neural Natural Language Interface for Databases , 2018, ArXiv.

[31]  Raymond J. Mooney,et al.  A Statistical Semantic Parser that Integrates Syntax and Semantics , 2005, CoNLL.

[32]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[33]  Jens Lehmann,et al.  Template-based question answering over RDF data , 2012, WWW.

[34]  Peter Thanisch,et al.  Natural language interfaces to databases – an introduction , 1995, Natural Language Engineering.

[35]  Zhengdong Lu,et al.  Neural Enquirer: Learning to Query Tables in Natural Language , 2016, IEEE Data Eng. Bull..

[36]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[37]  Mirella Lapata,et al.  Language to Logical Form with Neural Attention , 2016, ACL.

[38]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[39]  H. V. Jagadish,et al.  Qunits: queried units in database search , 2009, CIDR.

[40]  Percy Liang,et al.  Learning executable semantic parsers for natural language understanding , 2016, Commun. ACM.

[41]  Raymond J. Mooney,et al.  Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing , 2001, ECML.