Cinderella — Adaptive online partitioning of irregularly structured data

In an increasing number of use cases, databases face the challenge of managing irregularly structured data. Irregularly structured data is characterized by a quickly evolving variety of entities without a common set of attributes. These entities do not show enough regularity to be captured in a traditional database schema. A common solution is to centralize the diverse entities in a universal table. Usually, this leads to a very sparse table. Although today's techniques allow efficient storage of sparse universal tables, query efficiency is still a problem. Queries that reference only a subset of attributes have to read the whole universal table including many irrelevant entities. One possible solution is to use a partitioning of the table, which allows pruning partitions of irrelevant entities before they are touched. Creating and maintaining such a partitioning manually is very laborious or even infeasible, due to the enormous complexity. Thus an autonomous solution is desirable. In this paper, we define the Online Partitioning Problem for irregularly structured data and present Cinderella. Cinderella is an autonomous online algorithm for horizontal partitioning of irregularly structured entities in universal tables. It is designed to keep its overhead low by incrementally assigning entities to partitions while they are touched anyway during modifications. The achieved partitioning allows queries that retrieve only entities with a subset of attributes easily pruning partitions of irrelevant entities. Cinderella increases the locality of queries and reduces query execution cost.

[1]  Jeffrey Naughton,et al.  The case for a wide-table approach to manage sparse relational data sets , 2007, SIGMOD '07.

[2]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[3]  Mitesh Patel,et al.  Structured databases on the web: observations and implications , 2004, SGMD.

[4]  Rakesh Agrawal,et al.  Storage and Querying of E-Commerce Data , 2001, VLDB.

[5]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[6]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[7]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[8]  Jeffrey F. Naughton,et al.  Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Ke Wang,et al.  Discovering Structural Association of Semistructured Data , 2000, IEEE Trans. Knowl. Data Eng..

[10]  Jennifer Widom,et al.  Object exchange across heterogeneous information sources , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[11]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[12]  Serge Abiteboul,et al.  Extracting schema from semistructured data , 1998, SIGMOD '98.

[13]  Srini Acharya,et al.  Relational support for flexible schema scenarios , 2008, Proc. VLDB Endow..

[14]  Sam Lightstone,et al.  DB2 Design Advisor: Integrated Automatic Physical Database Design , 2004, VLDB.

[15]  Alekh Jindal,et al.  Relax and Let the Database Do the Partitioning Online , 2011, BIRTE.

[16]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.