LearnedSQLGen: Constraint-aware SQL Generation using Reinforcement Learning

Many database optimization problems, e.g., slow SQL diagnosis, database testing, optimizer tuning, require a large volume of SQL queries. Due to privacy issues, it is hard to obtain real SQL queries, and thus SQL generation is a very important task in database optimization. Existing SQL generation methods either randomly generate SQL queries or rely on human-crafted SQL templates to generate SQL queries, but they cannot meet various user specific requirements, e.g., slow SQL queries, SQL queries with large result sizes. To address this problem, this paper studies the problem of constraint-aware SQL generation, which, given a constraint (e.g., cardinality within [1k,2k]), generates SQL queries satisfying the constraint. This problem is rather challenging, because it is rather hard to capture the relationship from query constraint (e.g., cardinality and cost) to SQL queries and thus it is hard to guide a generation method to explore the SQL generation direction towards meeting the constraint. To address this challenge, we propose a reinforcement learning (RL) based framework LearnedSQLGen, for generating queries satisfying the constraint. LearnedSQLGen adopts an exploration-exploitation strategy that exploits the generation direction following the query constraint, which is learned from query execution feedback. We judiciously design the reward function in RL to guide the generation process accurately. We integrate a finite-state machine to generate valid SQL queries. Experimental results on three benchmarks showed that LearnedSQLGen significantly outperformed the baselines in terms of both accuracy (30% better) and efficiency (10-35 times).

[1]  Chengliang Chai,et al.  Data Management for Machine Learning: A Survey , 2023, IEEE Transactions on Knowledge and Data Engineering.

[2]  Jianhua Feng,et al.  AutoIndex: An Incremental Index Management System for Dynamic Workloads , 2022, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[3]  N. Tang,et al.  Selective Data Acquisition in the Wild for Model Charging , 2022, Proc. VLDB Endow..

[4]  Xuedi Qin,et al.  Steerable Self-Driving Data Visualization , 2022, IEEE Transactions on Knowledge and Data Engineering.

[5]  Zhengping Qian,et al.  Cardinality Estimation in DBMS: A Comprehensive Benchmark Evaluation , 2021, Proc. VLDB Endow..

[6]  Vivek R. Narasayya,et al.  DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems , 2021, Proc. VLDB Endow..

[7]  Jianhua Feng,et al.  A Learned Query Rewrite System using Monte Carlo Tree Search , 2021, Proc. VLDB Endow..

[8]  Mehdi Kaytoue-Uberall,et al.  "What makes my queries slow?": Subgroup Discovery for SQL Workload Analysis , 2021, 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[9]  Xuanhe Zhou,et al.  Machine Learning for Databases , 2021, Proc. VLDB Endow..

[10]  Jie Jiao,et al.  MB2: Decomposed Behavior Modeling for Self-Driving Database Management Systems , 2021, SIGMOD Conference.

[11]  Guoliang Li,et al.  AI Meets Database: AI4DB and DB4AI , 2021, SIGMOD Conference.

[12]  Jianliang Xu,et al.  Graph Learning for Combinatorial Optimization: A Survey of State-of-the-Art , 2021, Data Science and Engineering.

[13]  Andrew Pavlo,et al.  An Inquiry into Machine Learning-based Automatic Configuration Tuning Services on Real-World Database Management Systems , 2021, Proc. VLDB Endow..

[14]  Zhifeng Bao,et al.  A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration , 2021, Data Science and Engineering.

[15]  Chengliang Chai,et al.  FACE: A Normalizing Flow based Cardinality Estimator , 2021, Proc. VLDB Endow..

[16]  Bin Cao,et al.  A deep-learning prediction model for imbalanced time series data forecasting , 2021, Big Data Min. Anal..

[17]  Xuanhe Zhou,et al.  DBMind: A Self-Driving Platform in openGauss , 2021, Proc. VLDB Endow..

[18]  Peter Triantafillou,et al.  Learned Approximate Query Processing: Make it Light, Accurate and Fast , 2021, CIDR.

[19]  Yang Zhao,et al.  Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning , 2020, EMNLP.

[20]  M. de Rijke,et al.  Rethinking Supervised Learning and Reinforcement Learning in Task-Oriented Dialogue Systems , 2020, FINDINGS.

[21]  Randall G. Bello,et al.  Automated generation of materialized views in Oracle , 2020, Proc. VLDB Endow..

[22]  Jeffrey F. Naughton,et al.  DIAMetrics: Benchmarking Query Engines at Scale , 2020, Proc. VLDB Endow..

[23]  Dinghao Wu,et al.  SQUIRREL: Testing Database Management Systems with Language Validity and Coverage Feedback , 2020, CCS.

[24]  Lei Cao,et al.  Human-in-the-loop Outlier Detection , 2020, SIGMOD Conference.

[25]  Carsten Binnig,et al.  Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data , 2020, SIGMOD Conference.

[26]  Jianhua Feng,et al.  Query performance prediction for concurrent queries using graph embedding , 2020, Proc. VLDB Endow..

[27]  Joy Arulraj,et al.  SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns , 2020, SIGMOD Conference.

[28]  Guoliang Li,et al.  Automatic View Generation with Deep Learning and Reinforcement Learning , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[29]  Guoliang Li,et al.  Reinforcement Learning with Tree-LSTM for Join Order Selection , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[30]  Peter Triantafillou,et al.  ML-AQP: Query-Driven Approximate Query Processing based on Machine Learning , 2020, ArXiv.

[31]  Zhiyong Peng,et al.  Deep Reinforcement Learning-Based Approach to Tackle Topic-Aware Influence Maximization , 2020, Data Science and Engineering.

[32]  Tim Kraska,et al.  Learning Multi-Dimensional Indexes , 2019, SIGMOD Conference.

[33]  Badrish Chandramouli,et al.  ALEX: An Updatable Adaptive Learned Index , 2019, SIGMOD Conference.

[34]  Guoliang Li,et al.  Human-in-the-loop Techniques in Machine Learning , 2020, IEEE Data Eng. Bull..

[35]  Guoliang Li,et al.  QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning , 2019, Proc. VLDB Endow..

[36]  Ke Zhou,et al.  An End-to-End Automatic Cloud Database Tuning System Using Deep Reinforcement Learning , 2019, SIGMOD Conference.

[37]  K. Stefanidis,et al.  End-to-End Entity Resolution for Big Data: A Survey , 2019, ArXiv.

[38]  David Li,et al.  Design Continuums and the Path Toward Self-Designing Key-Value Stores that Know and Learn , 2019, CIDR.

[39]  Lijun Wu,et al.  A Study of Reinforcement Learning for Neural Machine Translation , 2018, EMNLP.

[40]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[41]  Olga Papaemmanouil,et al.  Deep Reinforcement Learning for Join Order Enumeration , 2018, aiDM@SIGMOD.

[42]  Hal Daumé,et al.  Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback , 2017, EMNLP.

[43]  Li Zhang,et al.  Learning to Learn: Meta-Critic Networks for Sample Efficient Learning , 2017, ArXiv.

[44]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[45]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[46]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[47]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[48]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[49]  Jeffrey F. Naughton,et al.  Predicting query execution time: Are optimizer cost models really unusable? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[50]  Surajit Chaudhuri,et al.  Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques , 2012, Proc. VLDB Endow..

[51]  Eli Upfal,et al.  Learning-based Query Performance Modeling and Prediction , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[52]  Nick Koudas,et al.  Generating targeted queries for database testing , 2008, SIGMOD Conference.

[53]  Leo Giakoumakis,et al.  A genetic approach for random testing of database systems , 2007, VLDB.

[54]  Surajit Chaudhuri,et al.  Generating Queries with Cardinality Constraints for DBMS Testing , 2006, IEEE Transactions on Knowledge and Data Engineering.

[55]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[56]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[57]  Marilyn A. Walker,et al.  Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.

[58]  Donald R. Slutz,et al.  Massive Stochastic Testing of SQL , 1998, VLDB.

[59]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[60]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .