Benchmarking DataStax Enterprise/Cassandra with HiBench

This report evaluates the new analytical capabilities of DataStax Enterprise (DSE) [1] through the use of standard Hadoop workloads. In particular, we run experiments with CPU and I/O bound micro-benchmarks as well as OLAP-style analytical query workloads. The performed tests should show that DSE is capable of successfully executing Hadoop applications without the need to adapt them for the underlying Cassandra distributed storage system [2]. Due to the Cassandra File System (CFS) [3], which supports the Hadoop Distributed File System API, Hadoop stack applications should seamlessly run in DSE. The report is structured as follows: Section 2 provides a brief description of the technologies involved in our study. An overview of our used hardware and software components of the experimental environment is given in Section 3. Our benchmark methodology is defined in Section 4. The performed experiments together with the evaluation of the results are presented in Section 5. Finally, Section 6 concludes with lessons learned.

[1]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[2]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[3]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Yonggang Wen,et al.  Toward Scalable Systems for Big Data Analytics: A Technology Tutorial , 2014, IEEE Access.

[6]  Todor Ivanov,et al.  On the inequality of the 3V's of Big Data Architectural Paradigms: A case for heterogeneity , 2013, ArXiv.

[7]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[8]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[9]  Madhusudhan Govindaraju,et al.  An Evaluation of Cassandra for Hadoop , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[10]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[11]  Sa Mo Er Hadoop Operations , 2013 .

[12]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[13]  Kathleen Ting,et al.  Apache Sqoop Cookbook , 2013 .

[14]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).