As the Web continues to grow in size, more and more services are being created that require data persistence. The amount of data that these services need to archive is growing at an exponential rate and so is the amount of accesses that these services have to serve. Additionally, the relationships between data are also increasing. In the past, problems involving data persistence have been consistently solved by relying on relational databases. Still, the requirements of these new services and the needs for scalability have led to a depletion of relational database technologies. New approaches to deal with these problems have been developed and the NoSQL movement was formed. This movement fosters the creation of new non-relational databases, specialized for different problem domains, with the intent of using the “right tool for the job”. Besides being nonrelational, these databases also have other characteristics in common such as: being distributed, trading consistency for availability, providing easy ways to scale horizontally, etc. As new technologies flourish, there is a perceived knowledge impedance that stems from the paradigm shift introduced by these technologies, which doesn’t allow developers to leverage the existing mass of knowledge associated with the traditional relational approach. This work aims to fill this knowledge gap by studying the available non-relational databases in order to develop a systematic approach for solving problems of data persistence using these technologies. The state of the art of non-relational databases was researched and several NoSQL databases were categorized regarding their: consistency, data model, replication and querying capabilities. A benchmarking framework was introduced in order to address the performance of NoSQL databases as well as their scalability and elasticity properties. A core set of benchmarks was defined and results are reported for three widely used systems: Cassandra, Riak and a simple sharded MySQL implementation which serves as a baseline. Data modeling with NoSQL was further researched and this study provides a simple methodology for modeling data in a non-relational database, as well as a set of common design patterns. This study was mainly focused on both Cassandra and Riak. Additionally, two prototypes using both Riak and Cassandra were implemented, which model a small chunk of a telecommunications operator’s business. These prototypes relied on the methodology and design patterns described earlier and were used as a proof of concept. Their performance was put to test by benchmarking a set of common (and usually expensive) operations against a traditional relational implementation. Both Cassandra and Riak were able to yield good results when compared to the relational implementation used as a baseline. They also proved to be easily scalable and elastic. Cassandra, specifically, achieved significantly better results for write operations than the other systems. The developed design patterns proved themselves useful when implementing the prototypes and it is expected that given this work it will be easier to adopt a NoSQL database.
[1]
Jeff Carpenter,et al.
Cassandra: The Definitive Guide
,
2010
.
[2]
J. Chris Anderson,et al.
CouchDB: The Definitive Guide
,
2010
.
[3]
F. Tödtling,et al.
One size fits all?: Towards a differentiated regional innovation policy approach
,
2005
.
[4]
Werner Vogels,et al.
Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.
,
2022
.
[5]
Eric A. Brewer,et al.
Towards robust distributed systems (abstract)
,
2000,
PODC '00.
[6]
E. F. Codd,et al.
Further Normalization of the Data Base Relational Model
,
1971,
Research Report / RJ / IBM / San Jose, California.
[7]
Peter Henderson,et al.
A lazy evaluator
,
1976,
POPL.
[8]
Nancy A. Lynch,et al.
Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services
,
2002,
SIGA.
[9]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[10]
Neal Leavitt,et al.
Will NoSQL Databases Live Up to Their Promise?
,
2010,
Computer.
[11]
Wilson C. Hsieh,et al.
Bigtable: A Distributed Storage System for Structured Data
,
2006,
TOCS.
[12]
Prashant Malik,et al.
Cassandra: a decentralized structured storage system
,
2010,
OPSR.
[13]
David R. Karger,et al.
Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web
,
1997,
STOC '97.
[14]
Andreas Reuter,et al.
Principles of transaction-oriented database recovery
,
1983,
CSUR.
[15]
G. Amdhal,et al.
Validity of the single processor approach to achieving large scale computing capabilities
,
1967,
AFIPS '67 (Spring).
[16]
David K. Gifford,et al.
Weighted voting for replicated data
,
1979,
SOSP '79.
[17]
E. F. CODD,et al.
A relational model of data for large shared data banks
,
1970,
CACM.
[18]
Werner Vogels,et al.
Dynamo: amazon's highly available key-value store
,
2007,
SOSP.
[19]
Henrik Loeser,et al.
"One Size Fits All": An Idea Whose Time Has Come and Gone?
,
2011,
BTW.
[20]
David P. Reed,et al.
Naming and synchronization in a decentralized computer system
,
1978
.
[21]
Adam Silberstein,et al.
Benchmarking cloud serving systems with YCSB
,
2010,
SoCC '10.
[22]
Dan Pritchett,et al.
BASE: An Acid Alternative
,
2008,
ACM Queue.