Speech-to-SQL: Towards Speech-driven SQL Query Generation From Natural Language Question

Speech-based inputs have been gaining significant momentumwith the popularity of smartphones and tablets in our daily lives, since voice is the most easiest and efficient way for human-computer interaction. This paper works towards designing more effective speech-based interfaces to query the structured data in relational databases. We first identify a new task named Speech-to-SQL, which aims to understand the information conveyed by human speech and directly translate it into structured query language (SQL) statements. A naive solution to this problem can work in a cascaded manner, that is, an automatic speech recognition (ASR) component followed by a text-to-SQL component. However, it requires a high-quality ASR system and also suffers from the error compounding problem between the two components, resulting in limited performance. To handle these challenges, we further propose a novel end-toend neural architecture named SpeechSQLNet to directly translate human speech into SQL queries without an external ASR step. SpeechSQLNet has the advantage of making full use of the rich linguistic information presented in speech. To the best of our knowledge, this is the first attempt to directly synthesize SQL based on arbitrary natural language questions, rather than a natural languagebased version of SQL or its variants with a limited SQL grammar. To validate the effectiveness of the proposed problem and model, we further construct a dataset named SpeechQL, by piggybacking the widely-used text-to-SQL datasets. Extensive experimental evaluations on this dataset show that SpeechSQLNet can directly synthesize high-quality SQL queries from human speech, outperforming various competitive counterparts as well as the cascaded methods in terms of exact match accuracies. We expect speech-toSQL would inspire more research on more effective and efficient human-machine interfaces to lower the barrier of using relational databases.

[1]  Anna Nowogrodzki,et al.  Speaking in code: how to program by voice , 2018, Nature.

[2]  Truong Q. Nguyen Near-perfect-reconstruction pseudo-QMF banks , 1994, IEEE Trans. Signal Process..

[3]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[6]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[7]  Tao Yu,et al.  Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions , 2019, EMNLP.

[8]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[9]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Vraj Shah,et al.  SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data , 2020, SIGMOD Conference.

[11]  Ruifang He,et al.  Multiple Knowledge Syncretic Transformer for Natural Dialogue Generation , 2020, WWW.

[12]  Donald Kossmann,et al.  Data-thirsty business analysts need SODA: search over data warehouse , 2011, CIKM '11.

[13]  Kenneth Heafield,et al.  Incorporating Source Syntax into Transformer-Based Neural Machine Translation , 2019, WMT.

[14]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[17]  Mauro Cettolo,et al.  A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation , 2018, COLING.

[18]  Yunfeng Liu,et al.  TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation , 2020, ArXiv.

[19]  Raymond Chi-Wing Wong,et al.  Handling Information Loss of Graph Neural Networks for Session-based Recommendation , 2020, KDD.

[20]  Eric J. Rapos,et al.  Voice-Driven Modeling: Software Modeling Using Automated Speech Recognition , 2019, 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C).

[21]  Dawn Xiaodong Song,et al.  SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning , 2017, ArXiv.

[22]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  H. V. Jagadish,et al.  NaLIR: an interactive natural language interface for querying relational databases , 2014, SIGMOD Conference.

[24]  Hyeonji Kim,et al.  Natural language to SQL: Where are we today? , 2020, Proc. VLDB Endow..

[25]  Wolfgang Nejdl,et al.  From keywords to semantic queries - Incremental query construction on the semantic web , 2009, J. Web Semant..

[26]  Hermann Ney,et al.  A Comparison of Transformer and LSTM Encoder Decoder Models for ASR , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[27]  Prasetya Utama,et al.  Voice-based Data Exploration : Chatting with your Database , 2017 .

[28]  Yan Gao,et al.  Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation , 2019, ACL.

[29]  Xiaodong Liu,et al.  RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers , 2020, ACL.

[30]  Jiangyan Yi,et al.  Self-Attention Transducers for End-to-End Speech Recognition , 2019, INTERSPEECH.

[31]  Raymond Chi-Wing Wong,et al.  L2RS: A Learning-to-Rescore Mechanism for Automatic Speech Recognition , 2019, ArXiv.

[32]  Qian Xu,et al.  Understanding User Perceptions of Robot's Delay, Voice Quality-Speed Trade-off and GUI during Conversation , 2020, CHI Extended Abstracts.

[33]  Sanjay Silakari,et al.  Natural language Interface for Database: A Brief review , 2011 .

[34]  Jonathan Foote,et al.  Content-based retrieval of music and audio , 1997, Other Conferences.

[35]  Immanuel Trummer Demonstrating the voice-based exploration of large data sets with CiceroDB-zero , 2020, Proc. VLDB Endow..

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Seung-won Hwang,et al.  KBQA: Learning Question Answering over QA Corpora and Knowledge Bases , 2019, Proc. VLDB Endow..

[38]  Shuang Xu,et al.  A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese , 2018, ICONIP.

[39]  Yoshua Bengio,et al.  MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis , 2019, NeurIPS.

[40]  Jihua Zhu,et al.  S2IGAN: Speech-to-Image Generation via Adversarial Learning , 2020, INTERSPEECH.

[41]  Stefan C. Kremer,et al.  Recurrent Neural Networks , 2013, Handbook on Neural Information Processing.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[44]  Meir Tzur,et al.  Speech reconstruction from mel frequency cepstral coefficients and pitch frequency , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[45]  Doyen Sahoo,et al.  Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.

[46]  Tao Yu,et al.  TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation , 2018, NAACL.

[47]  Abraham Bernstein,et al.  A comparative survey of recent natural language interfaces for databases , 2019, The VLDB Journal.

[48]  Wen Gao,et al.  Direct Speech-to-Image Translation , 2020, IEEE Journal of Selected Topics in Signal Processing.

[49]  Alain Désilets,et al.  VoiceCode: an innovative speech interface for programming-by-voice , 2006, CHI Extended Abstracts.

[50]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51]  Cathal Gurrin,et al.  Voxento: A Prototype Voice-controlled Interactive Search Engine for Lifelogs , 2020, LSC@ICMR.

[52]  Tao Qin,et al.  FastSpeech 2: Fast and High-Quality End-to-End Text to Speech , 2021, ICLR.

[53]  Adam Lopez,et al.  Towards speech-to-text translation without speech recognition , 2017, EACL.

[54]  Jonathan Berant,et al.  Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing , 2019, ACL.

[55]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[56]  Carsten Binnig,et al.  Making the Case for Query-by-Voice with EchoQuery , 2016, SIGMOD Conference.

[57]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[58]  Pat Helland,et al.  Database Management System , 2009, Encyclopedia of Database Systems.

[59]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[60]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[61]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[62]  Frank K. Soong,et al.  TTS synthesis with bidirectional LSTM based recurrent neural networks , 2014, INTERSPEECH.

[63]  Philipp Cimiano,et al.  AMUSE: Multilingual Semantic Parsing for Question Answering over Linked Data , 2017, International Semantic Web Conference.

[64]  Adrian La'ncucki FastPitch: Parallel Text-to-speech with Pitch Prediction , 2020, ArXiv.

[65]  Abdul Quamar,et al.  ATHENA++: Natural Language Querying for Complex Nested SQL Queries , 2020, Proc. VLDB Endow..

[66]  Qian Xu,et al.  GoldenRetriever: A Speech Recognition System Powered by Modern Information Retrieval , 2020, ACM Multimedia.

[67]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[68]  Seung-won Hwang,et al.  NL2pSQL: Generating Pseudo-SQL Queries from Under-Specified Natural Language Questions , 2019, EMNLP.

[69]  George Obaido,et al.  TalkSQL: A Tool for the Synthesis of SQL Queries from Verbal Specifications , 2020, 2020 2nd International Multidisciplinary Information Technology and Engineering Conference (IMITEC).

[70]  Tao Yu,et al.  SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task , 2018, EMNLP.

[71]  Arun Kumar,et al.  Demonstration of SpeakQL: Speech-driven Multimodal Querying of Structured Data , 2019, SIGMOD Conference.

[72]  Ming Zhou,et al.  Semantic Parsing with Syntax- and Table-Aware SQL Generation , 2018, ACL.

[73]  Lei Zou,et al.  Natural Language Question/Answering: Let Users Talk With The Knowledge Graph , 2017, CIKM.

[74]  Weixin Wang,et al.  Re-examining the Role of Schema Linking in Text-to-SQL , 2020, EMNLP.

[75]  Sören Auer,et al.  SINA: Semantic interpretation of user queries for question answering on interlinked data , 2015, J. Web Semant..

[76]  Rohit Prabhavalkar,et al.  Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).