Towards Auto-Generated Data Systems

After decades of progress, database management systems (DBMSs) are now the backbones of many data applications that we interact with on a daily basis. Yet, with the emergence of new data types and hardware, building and optimizing new data systems remain as difficult as the heyday of relational databases. In this paper, we summarize our work towards automating the building and optimization of data systems. Drawing from our own experience, we further argue that any automation technique must address three aspects: user specification, code generation, and result validation. We conclude by discussing a case study using videos data processing, along with opportunities for future research towards designing data systems that are automatically generated.

[1]  Alvin Cheung,et al.  Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations , 2023, ArXiv.

[2]  Alvin Cheung,et al.  Optimizing Stateful Dataflow with Local Rewrites , 2023, ArXiv.

[3]  J. Hellerstein,et al.  Keep CALM and CRDT On , 2022, Proc. VLDB Endow..

[4]  J. Hellerstein,et al.  Katara: synthesizing CRDTs with verified lifting , 2022, Proc. ACM Program. Lang..

[5]  Alvin Cheung,et al.  Leveraging Application Data Constraints to Optimize Database-Backed Web Applications , 2022, Proc. VLDB Endow..

[6]  Chenglong Wang,et al.  Synthesizing analytical SQL queries from computation demonstration , 2022, PLDI.

[7]  Sufyan bin Uzayr GitHub , 2022, Mastering Git.

[8]  Alvin Cheung,et al.  Demonstration of Apperception: A Database Management System for Geospatial Video Data , 2021, Proc. VLDB Endow..

[9]  Alvin Cheung,et al.  Falx: Synthesis-Powered Visualization Authoring , 2021, CHI.

[10]  Alvin Cheung,et al.  New Directions in Cloud Programming , 2021, CIDR.

[11]  Esteban Zimányi,et al.  MobilityDB: A Mobility Database Based on PostgreSQL and PostGIS , 2020, ACM Trans. Database Syst..

[12]  Alvin Cheung,et al.  TASM: A Tile-Based Storage Manager for Video Analytics , 2020, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[13]  Alvin Cheung,et al.  Demonstration of Chestnut: An In-memory Data Layout Designer for Database Applications , 2020, SIGMOD Conference.

[14]  Pavel Panchekha,et al.  egg: Fast and extensible equality saturation , 2020, Proc. ACM Program. Lang..

[15]  Yu Feng,et al.  Visualization by example , 2019, Proc. ACM Program. Lang..

[16]  Alvin Cheung,et al.  Generating Application-specific Data Layouts for In-memory Databases , 2019, Proc. VLDB Endow..

[17]  Alvin Cheung,et al.  View-Centric Performance Optimization for Database-Backed Web Applications , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[18]  Alvin Cheung,et al.  PowerStation: automatically detecting and fixing inefficiencies of database-backed web applications in IDE , 2018, ESEC/SIGSOFT FSE.

[19]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Alvin Cheung,et al.  LightDB: A DBMS for Virtual Reality Video , 2018, Proc. VLDB Endow..

[21]  Alvin Cheung,et al.  How not to Structure Your Database-Backed Web Applications: A Study of Performance Bugs in the Wild , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[22]  Alvin Cheung,et al.  Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences of SQL Queries , 2018, Proc. VLDB Endow..

[23]  Maaz Bin Safeer Ahmad,et al.  Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications , 2018, SIGMOD Conference.

[24]  Alvin Cheung,et al.  Understanding Database Performance Inefficiencies in Real-world Web Applications , 2017, CIKM.

[25]  Tim Milliron,et al.  Hallelujah: the world's first lytro VR experience , 2017, SIGGRAPH VR Village.

[26]  Alvin Cheung,et al.  Synthesizing highly expressive SQL queries from input-output examples , 2017, PLDI.

[27]  Alvin Cheung,et al.  Demonstration of the Cosette Automated SQL Prover , 2017, SIGMOD Conference.

[28]  Maaz Bin Safeer Ahmad,et al.  Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data Processing Frameworks , 2017, SIGMOD Conference.

[29]  Alvin Cheung,et al.  Interactive Query Synthesis from Input-Output Examples , 2017, SIGMOD Conference.

[30]  Geoffrey J. Gordon,et al.  Automatic Database Management System Tuning Through Large-scale Machine Learning , 2017, SIGMOD Conference.

[31]  Alvin Cheung,et al.  VisualCloud Demonstration: A DBMS for Virtual Reality , 2017, SIGMOD Conference.

[32]  Alvin Cheung,et al.  HoTTSQL: proving query rewrites with univalent SQL semantics , 2016, PLDI.

[33]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[34]  Alvin Cheung,et al.  Packet Transactions: A Programming Model for Data-Plane Algorithms at Hardware Speed , 2015, ArXiv.

[35]  Emina Torlak,et al.  A lightweight symbolic virtual machine for solver-aided host languages , 2014, PLDI.

[36]  Minhua Zhou,et al.  An Overview of Tiles in HEVC , 2013, IEEE Journal of Selected Topics in Signal Processing.

[37]  Alvin Cheung,et al.  Optimizing database-backed applications with query synthesis , 2013, PLDI.

[38]  Abraham Silberschatz,et al.  DataPlay: interactive tweaking and example-driven correction of graphical database queries , 2012, UIST.

[39]  Marc Shapiro,et al.  Conflict-Free Replicated Data Types , 2011, SSS.

[40]  David Maier,et al.  Dedalus: Datalog in Time and Space , 2010, Datalog.

[41]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[42]  Val Tannen,et al.  Provenance semirings , 2007, PODS.

[43]  Sanjit A. Seshia,et al.  Combinatorial sketching for finite programs , 2006, ASPLOS XII.

[44]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[45]  Albert Oliveras,et al.  Proof-Producing Congruence Closure , 2005, RTA.

[46]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[47]  Michael Stonebraker,et al.  The Implementation of Postgres , 1990, IEEE Trans. Knowl. Data Eng..

[48]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[49]  T. G. Price,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[50]  Moshé M. Zloof Query-by-example: the invocation and definition of tables and forms , 1975, VLDB '75.

[51]  A. Mostowski Review: B. A. Trahtenbrot, Impossibility of an Algorithm for the Decision Problem in Finite Classes , 1950, Journal of Symbolic Logic.

[52]  Alvin Cheung,et al.  Building Code Transpilers for Domain-Specific Languages Using Program Synthesis (Experience Paper) , 2023, ECOOP.

[53]  PLDI '22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, San Diego, CA, USA, June 13 - 17, 2022 , 2022, PLDI.

[54]  Alvin Cheung,et al.  VisualWorldDB: A DBMS for the Visual World , 2020, CIDR.

[55]  Alvin Cheung,et al.  View-Driven Optimization of Database-Backed Web Applications , 2020, CIDR.

[56]  Adam Sullivan,et al.  Workshop 2: Tidyverse: R Packages for Data Science , 2019 .

[57]  Alvin Cheung,et al.  Cosette: An Automated Prover for SQL , 2017, CIDR.

[58]  Zhengyou Zhang,et al.  Camera Parameters (Intrinsic, Extrinsic) , 2014, Computer Vision, A Reference Guide.

[59]  Charles Gregory Nelson,et al.  Techniques for program verification , 1979 .