Database-Agnostic Workload Management

We present a system to support generalized SQL workload analysis and management for multi-tenant and multi-database platforms. Workload analysis applications are becoming more sophisticated to support database administration, model user behavior, audit security, and route queries, but the methods rely on specialized feature engineering, and therefore must be carefully implemented and reimplemented for each SQL dialect, database system, and application. Meanwhile, the size and complexity of workloads are increasing as systems centralize in the cloud. We model workload analysis and management tasks as variations on query labeling, and propose a system design that can support general query labeling routines across multiple applications and database backends. The design relies on the use of learned vector embeddings for SQL queries as a replacement for application-specific syntactic features, reducing custom code and allowing the use of off-the-shelf machine learning algorithms for labeling. The key hypothesis, for which we provide evidence in this paper, is that these learned features can outperform conventional feature engineering on representative machine learning tasks. We present the design of a database-agnostic workload management and analytics service, describe potential applications, and show that separating workload representation from labeling tasks affords new capabilities and can outperform existing solutions for representative tasks, including workload sampling for index recommendation and user labeling for security audits.

[1]  Shrainik Jain,et al.  Query2Vec: NLP Meets Databases for Generalized Workload Analytics , 2018, ArXiv.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  Philip S. Yu,et al.  On Workload Characterization of Relational Database Environments , 1992, IEEE Trans. Software Eng..

[5]  Surajit Chaudhuri,et al.  Compressing SQL workloads , 2002, SIGMOD '02.

[6]  Neoklis Polyzotis,et al.  SQL QueRIE recommendations , 2010, Proc. VLDB Endow..

[7]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[8]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[9]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[10]  Ting Liu,et al.  Document Modeling with Gated Recurrent Neural Network for Sentiment Classification , 2015, EMNLP.

[11]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[12]  Anurag Gupta,et al.  Amazon Redshift and the Case for Simpler Data Warehouses , 2015, SIGMOD Conference.

[13]  Shrainik Jain,et al.  Data Cleaning in the Wild: Reusable Curation Idioms from a Multi-Year SQL Workload , 2016 .

[14]  Dan Suciu,et al.  SnipSuggest: Context-Aware Autocompletion for SQL , 2010, Proc. VLDB Endow..

[15]  Shrainik Jain,et al.  SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment , 2016, SIGMOD Conference.

[16]  Lin Ma,et al.  Self-Driving Database Management Systems , 2017, CIDR.

[17]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[18]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[19]  Carsten Sapia,et al.  PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems , 2000, DaWaK.

[20]  Piotr Kolaczkowski Compressing Very Large Database Workloads for Continuous Online Index Selection , 2008, DEXA.

[21]  Neoklis Polyzotis,et al.  Oracle Workload Intelligence , 2015, SIGMOD Conference.

[22]  Dan Suciu,et al.  Automatic Enforcement of Data Use Policies with DataLawyer , 2015, SIGMOD Conference.

[23]  Torsten Grust,et al.  Observing SQL queries in their natural habitat , 2013, TODS.

[24]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[25]  Shrainik Jain,et al.  Snowtrail: Testing with Production Queries on a Cloud Database , 2018, DBTest@SIGMOD.

[26]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[27]  Ashish Motivala,et al.  The Snowflake Elastic Data Warehouse , 2016, SIGMOD Conference.

[28]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[29]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[30]  Surajit Chaudhuri,et al.  Primitives for Workload Summarization and Implications for SQL , 2003, VLDB.

[31]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[32]  Trupti M. Kodinariya,et al.  Review on determining number of Cluster in K-Means Clustering , 2013 .

[33]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[34]  Philip S. Yu,et al.  Characterization of database access pattern for analytic prediction of buffer hit probability , 2005, The VLDB Journal.

[35]  Richard T. Snodgrass,et al.  Generalizing database forensics , 2013, TODS.

[36]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.