AsterixDB Mid-Flight: A Case Study in Building Systems in Academia

Building large software systems is always a challenging venture, but it is especially so in academia. This paper describes the experiences that the author and his (mostly UC-based) partners in software crime have had that culminated in the Big Data Management System now available as Apache AsterixDB. It covers a mix of the history and technical content of the nearly ten-year-old project, starting with its inception during the MapReduce craze. It describes the phases that the effort has gone through and some of the lessons learned along the way. The paper also covers some personal reflections and opinions about the challenges of systems-building, as well as writing about it, in our current academic culture. Included is the case for doing this sort of work at all – discussing the pitfalls of doing "systems" research in the absence of an actual system, and why the gain outweighs the pain of building and sharing database software in academia. As of late 2018, Apache AsterixDB is also having a commercial impact as the storage and parallel query engine underlying a new offering called Couchbase Analytics. The last part of the paper explains how we are attempting to balance the uses of AsterixDB as (i) a generally available open source Apache software platform, (ii) an end-to-end research testbed for universities, and (iii) the technology powering a commercial NoSQL product.

[1]  Yannis Papakonstantinou,et al.  The SQL++ Semi-structured Data Model and Query Language: A Capabilities Survey of SQL-on-Hadoop, NoSQL and NewSQL Databases , 2014, ArXiv.

[2]  Karsten Schwan,et al.  StarOS, a multiprocessor operating system for the support of task forces , 1979, SOSP '79.

[3]  David J. DeWitt,et al.  The Architecture of the EXODUS Extensible DBMS , 1986, On Object-Oriented Database System.

[4]  Craig Schaffert,et al.  Abstraction mechanisms in CLU , 1977 .

[5]  David J. DeWitt,et al.  Shoring up persistent applications , 1994, SIGMOD '94.

[6]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[7]  Barbara Liskov,et al.  The Argus Language and System , 1984, Advanced Course: Distributed Systems.

[8]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[9]  Gordon Bell,et al.  C.mmp: a multi-mini-processor , 1972, AFIPS '72 (Fall, part II).

[10]  Richard J. Swan,et al.  The implementation of the Cm* multi-microprocessor , 1899, AFIPS '77.

[11]  John K. Ousterhout,et al.  Medusa: An experiment in distributed operating system structure (Summary) , 1979, SOSP '79.

[12]  Alin Deutsch,et al.  ASTERIX: towards a scalable, semistructured data platform for evolving-world models , 2011, Distributed and Parallel Databases.

[13]  Harumi A. Kuno,et al.  Modern B-tree techniques , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[14]  Rares Vernica,et al.  Flexible and Extensible Foundation for Data- Intensive Computing , 2011 .

[15]  Chen Li,et al.  Inside "Big Data management": ogres, onions, or parfaits? , 2012, EDBT '12.

[16]  Chen Li,et al.  A Comparative Study of Log-Structured Merge-Tree-Based Spatial Indexes for Big Data , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[17]  Michael J. Carey,et al.  Algebricks: a data model-agnostic compiler backend for big data languages , 2015, SoCC.

[18]  Chen Li,et al.  Storage Management in AsterixDB , 2014, Proc. VLDB Endow..

[19]  Gloria Mark,et al.  Stress and multitasking in everyday college life: an empirical study of online activity , 2014, CHI.

[20]  Michael J. Carey,et al.  Breaking BAD: a data serving vision for big active data , 2016, DEBS.

[21]  Michael J. Carey,et al.  Have Your Data and Query It Too: From Key-Value Caching to Big Data Management , 2016, SIGMOD Conference.

[22]  Neoklis Polyzotis,et al.  Scaling Datalog for Machine Learning on Big Data , 2012, ArXiv.

[23]  Michael Stonebraker,et al.  "One size fits all": an idea whose time has come and gone , 2018, Making Databases Work.

[24]  Michael L. Brodie Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker , 2019, Making Databases Work.

[25]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[26]  Chen Li,et al.  Robust and efficient memory management in Apache AsterixDB , 2020, Softw. Pract. Exp..