NeuroDB: A Neural Network Framework for Answering Range Aggregate Queries and Beyond

Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning models, where a model of the data is learned to answer the queries. However, such modelling choices fail to utilize any query specific information. To capture such information, we observe that RAQs can be represented by query functions, which are functions that take a query instance (i.e., a specific RAQ) as an input and output its corresponding answer. Using this representation, we formulate the problem of learning to approximate the query function, and propose NeuroDB, a query specialized neural network framework, that answers RAQs efficiently. We experimentally show that NeuroDB answers RAQs orders of magnitude faster than the state-of-the-art on real-world, benchmark and synthetic datasets. Furthermore, NeuroDB is query-type agnostic (i.e., it does not make any assumption about the underlying query type) and our observation that queries can be represented by functions is not specific to RAQs. Thus, we investigate whether NeuroDB can be used for other query types, by applying it to distance to nearest neighbour queries. We experimentally show that NeuroDB outperforms the state-of-the-art for this query type, often by orders of magnitude. Moreover, the same neural network architecture as for RAQs is used, bringing to light the possibility of using a generic framework to answer any query type efficiently.

[1]  Kentaro Inui,et al.  Selective Sampling for Example-based Word Sense Disambiguation , 1998, CL.

[2]  Badrish Chandramouli,et al.  ALEX: An Updatable Adaptive Learned Index , 2019, SIGMOD Conference.

[3]  Qiang Huang,et al.  Query-Aware Locality-Sensitive Hashing for Approximate Nearest Neighbor Search , 2015, Proc. VLDB Endow..

[4]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[5]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[6]  Anshumali Shrivastava,et al.  Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search , 2017, SIGMOD Conference.

[7]  Peter Triantafillou,et al.  DBEst: Revisiting Approximate Query Processing Engines with Machine Learning Models , 2019, SIGMOD Conference.

[8]  Andrew Slavin Ross,et al.  Learning Key-Value Store Design , 2019, ArXiv.

[9]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[10]  Gao Cong,et al.  A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation , 2021, SIGMOD Conference.

[11]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[12]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[13]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[14]  Barzan Mozafari,et al.  VerdictDB: Universalizing Approximate Query Processing , 2018, SIGMOD Conference.

[15]  Mohamed S. Kamel,et al.  Equal-average hyperplane partitioning method for vector quantization of image data , 1992, Pattern Recognit. Lett..

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[18]  Michael Mitzenmacher,et al.  A Model for Learned Bloom Filters and Optimizing by Sandwiching , 2018, NeurIPS.

[19]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[20]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[21]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[22]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[23]  Xiaodong Chen,et al.  Combo-Attention Network for Baidu Video Advertising , 2020, KDD.

[24]  Surajit Chaudhuri,et al.  Optimized stratified sampling for approximate query processing , 2007, TODS.

[25]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[26]  Sunil Arya,et al.  ANN: library for approximate nearest neighbor searching , 1998 .

[27]  Jianmin Wang,et al.  Transferable Attention for Domain Adaptation , 2019, AAAI.

[28]  Andreas Kipf,et al.  Learned Cardinalities: Estimating Correlated Joins with Deep Learning , 2018, CIDR.

[29]  Magdalena Balazinska,et al.  Learning State Representations for Query Optimization with Deep Reinforcement Learning , 2018, DEEM@SIGMOD.

[30]  Tim Kraska,et al.  Lifting the Curse of Multidimensional Data with Learned Existence Indexes , 2018 .

[31]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[32]  Carsten Binnig,et al.  DeepDB , 2019, Proc. VLDB Endow..

[33]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[34]  Michael I. Jordan,et al.  Transferable Adversarial Training: A General Approach to Adapting Deep Classifiers , 2019, ICML.

[35]  Deng Cai,et al.  Fast Approximate Nearest Neighbor Search With Navigating Spreading-out Graphs , 2017, ArXiv.

[36]  Douglas A. Reynolds,et al.  Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[37]  Forest Baskett,et al.  An Algorithm for Finding Nearest Neighbors , 1975, IEEE Transactions on Computers.

[38]  Ion Stoica,et al.  BlinkDB: queries with bounded errors and bounded response times on very large data , 2012, EuroSys '13.

[39]  Anthony K. H. Tung,et al.  LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index , 2016, SIGMOD Conference.

[40]  Ping Li,et al.  SONG: Approximate Nearest Neighbor Search on GPU , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[41]  Nick Koudas,et al.  Approximate Query Processing using Deep Generative Models , 2019, ArXiv.

[42]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[43]  Xuan Liang,et al.  Assessing Beijing's PM2.5 pollution: severity, weather impact, APEC and winter heating , 2015, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  Cyrus Shahabi,et al.  A Class of R*-tree Indexes for Spatial-Visual Search of Geo-tagged Street Images , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[46]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[47]  Yasin Abbasi-Yadkori,et al.  Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[48]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[49]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[50]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[52]  Cyrus Shahabi,et al.  ProPolyne: A Fast Wavelet-Based Algorithm for Progressive Evaluation of Polynomial Range-Sum Queries , 2002, EDBT.

[53]  Abdul Wasay,et al.  Learning Data Structure Alchemy , 2019, IEEE Data Eng. Bull..

[54]  Tim Kraska,et al.  SageDB: A Learned Database System , 2019, CIDR.

[55]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[56]  L. Bottou,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[57]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[58]  Xing Xie,et al.  Mining Individual Life Pattern Based on Location History , 2009, 2009 Tenth International Conference on Mobile Data Management: Systems, Services and Middleware.

[59]  Christoforos E. Kozyrakis,et al.  Learning Memory Access Patterns , 2018, ICML.