An Evaluation of Cassandra for Hadoop

In the last decade, the increased use and growth of social media, unconventional web technologies, and mobile applications, have all encouraged development of a new breed of database models. NoSQL data stores target the unstructured data, which by nature is dynamic and a key focus area for "Big Data" research. New generation data can prove costly and unpractical to administer with SQL databases due to lack of structure, high scalability, and elasticity needs. NoSQL data stores such as MongoDB and Cassandra provide a desirable platform for fast and efficient data queries. This leads to increased importance in areas such as cloud applications, e-commerce, social media, bioinformatics, and materials science. In an effort to combine the querying capabilities of conventional database systems and the processing power of the MapReduce model, this paper presents a thorough evaluation of the Cassandra NoSQL database when used in conjunction with the Hadoop MapReduce engine. We characterize the performance for a wide range of representative use cases, and then compare, contrast, and evaluate so that application developers can make informed decisions based upon data size, cluster size, replication factor, and partitioning strategy to meet their performance needs.

[1]  Lavanya Ramakrishnan,et al.  Benchmarking MapReduce Implementations for Application Usage Scenarios , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[2]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[3]  Tilmann Rabl,et al.  Solving Big Data Challenges for Enterprise Application Performance Management , 2012, Proc. VLDB Endow..

[4]  Madhusudhan Govindaraju,et al.  LEMO-MR: Low Overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[5]  Gordon Ball,et al.  Data Aggregation System - a system for information retrieval on demand over relational and non-relational distributed data sources , 2011 .

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[8]  Rabi Prasad Padhy,et al.  RDBMS to NoSQL: Reviewing Some Next-Generation Non-Relational Database's , 2011 .

[9]  Raghu Ramakrishnan,et al.  Efficient bulk insertion into a distributed ordered table , 2008, SIGMOD Conference.

[10]  Lei Gao,et al.  Serving large-scale batch computed data with project Voldemort , 2012, FAST.

[11]  Jeff Carpenter,et al.  Cassandra: The Definitive Guide , 2010 .

[12]  Lavanya Ramakrishnan,et al.  MARIANE: MApReduce Implementation Adapted for HPC Environments , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[13]  Lavanya Ramakrishnan,et al.  Evaluating Hadoop for Data-Intensive Scientific Operations , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[14]  Kiyoung Kim,et al.  MRBench: A Benchmark for MapReduce Framework , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[15]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[16]  Madhusudhan Govindaraju,et al.  DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[17]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[18]  Ronald C. Taylor An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics , 2010, BMC Bioinformatics.

[19]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[20]  Jianling Sun,et al.  Scalable RDF store based on HBase and MapReduce , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[21]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[22]  Horacio González-Vélez,et al.  Benchmarking a MapReduce Environment on a Full Virtualisation Platform , 2010, 2010 International Conference on Complex, Intelligent and Software Intensive Systems.

[23]  Hans-Arno Jacobsen,et al.  PNUTS: Yahoo!'s hosted data serving platform , 2008, Proc. VLDB Endow..

[24]  Madhusudhan Govindaraju,et al.  MARLA: MapReduce for Heterogeneous Clusters , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[25]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).