Scalable Data Management: An In-Depth Tutorial on NoSQL Data Stores

The unprecedented scale at which data is consumed and generated today has shown a large demand for scalable data management and given rise to non-relational, distributed “NoSQL” database systems. Two central problems triggered this process: 1) vast amounts of user-generated content in modern applications and the resulting request loads and data volumes as well as 2) the desire of the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems – now commonly referred to as NoSQL data stores – make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system’s design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research challenges. In addition to earlier tutorials, we explicitly address how the quickly emerging topic of processing and storing massive amounts of data in real-time can be solved by different types real-time data management systems.

[1]  Norbert Ritter,et al.  Scalable data management: NoSQL data stores in research and practice , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[2]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[3]  David Zhang,et al.  On brewing fresh espresso: LinkedIn's distributed data serving platform , 2013, SIGMOD '13.

[4]  Ming Di,et al.  Joy , 1889, The Hospital.

[5]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[6]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[7]  Norbert Ritter,et al.  NoSQL OLTP Benchmarking: A Survey , 2014, GI-Jahrestagung.

[8]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[9]  Marko Vukolic,et al.  Consistency in Non-Transactional Distributed Storage Systems , 2015, ACM Comput. Surv..

[10]  Norbert Ritter,et al.  The Cache Sketch: Revisiting Expiration-based Caching in the Age of Cloud Data Management , 2015, BTW.

[11]  Ali Ghodsi,et al.  Highly Available Transactions: Virtues and Limitations , 2013, Proc. VLDB Endow..

[12]  Norbert Ritter,et al.  Orestes: A scalable Database-as-a-Service architecture for low latency , 2014, 2014 IEEE 30th International Conference on Data Engineering Workshops.

[13]  Michael Stonebraker,et al.  The End of an Architectural Era (It's Time for a Complete Rewrite) , 2007, VLDB.

[14]  Martin Fowler,et al.  NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence , 2012 .

[15]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[16]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[17]  Norbert Ritter,et al.  Towards Automated Polyglot Persistence , 2015, BTW.

[18]  Norbert Ritter,et al.  Who Watches the Watchmen? On the Lack of Validation in NoSQL Benchmarking , 2015, BTW.

[19]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[20]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[21]  Norbert Ritter,et al.  Skalierbare NoSQL- und Cloud-Datenbanken in Forschung und Praxis , 2015, BTW Workshops.

[22]  E. Brewer,et al.  CAP twelve years later: How the "rules" have changed , 2012, Computer.

[23]  Ali Ghodsi,et al.  Scalable atomic visibility with RAMP transactions , 2014, SIGMOD Conference.

[24]  Wolfgang Lehner,et al.  Web-Scale Data Management for the Cloud , 2013, Springer New York.

[25]  Norbert Ritter,et al.  Towards a Scalable and Unified REST API for Cloud Data Stores , 2014, GI-Jahrestagung.

[26]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.