An LSM-based tuple compaction framework for Apache AsterixDB

Document database systems store self-describing semi-structured records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not free. The large amount of redundancy in the records can introduce an unnecessary storage overhead and impact query performance. Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing semi-structured records during the data ingestion. As many prominent document stores, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM lifecycle events to piggyback the schema inference and extraction operations. We have implemented and empirically evaluated our approach to measure its impact on storage, data ingestion, and query performance in the context of Apache AsterixDB.

[1]  Chen Luo,et al.  LSM-based storage techniques: a survey , 2018, The VLDB Journal.

[2]  Ioana Manolescu,et al.  Efficient Query Evaluation over Compressed XML Data , 2004, EDBT.

[3]  Chen Wang,et al.  Schema Management for Document Stores , 2015, Proc. VLDB Endow..

[4]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[5]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[6]  Ji Huang,et al.  Schema-Agnostic Indexing with Azure DocumentDB , 2015, Proc. VLDB Endow..

[8]  Michael J. Carey,et al.  Algebricks: a data model-agnostic compiler backend for big data languages , 2015, SoCC.

[9]  Chen Li,et al.  LSM-Based Storage and Indexing: An Old Idea with Timely Benefits , 2015, GeoRich@SIGMOD.

[10]  Yannis Papakonstantinou,et al.  The SQL++ Query Language: Configurable, Unifying and Semi-structured , 2014, 1405.3631.

[11]  David J. DeWitt,et al.  Data page layouts for relational databases on deep memory hierarchies , 2002, The VLDB Journal.

[12]  Dan Suciu,et al.  XMill: an efficient compressor for XML data , 2000, SIGMOD '00.

[13]  Michael J. Carey,et al.  Lightweight Cardinality Estimation in LSM-based Systems , 2018, SIGMOD Conference.

[14]  Michael J. Carey,et al.  AsterixDB Mid-Flight: A Case Study in Building Systems in Academia , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[15]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[16]  Patrick E. O'Neil,et al.  The log-structured merge-tree (LSM-tree) , 1996, Acta Informatica.

[17]  Chen Li,et al.  Storage Management in AsterixDB , 2014, Proc. VLDB Endow..

[18]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[19]  Daniel J. Abadi,et al.  Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data , 2016, SIGMOD Conference.

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[21]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[22]  Michael J. Carey,et al.  BigFUN: A performance study of big data management system functionality , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[23]  Chen Luo,et al.  Efficient Data Ingestion and Query Processing for LSM-Based Storage Systems , 2018, Proc. VLDB Endow..

[24]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.