Row Key Designs of NoSQL Database Tables and Their Impact on Write Performance

In several NoSQL database systems, among which is HBase, only one index is available for the tables, which is also the row key and the clustered index. Using other indexes does not come out of the box. As a result, the row key design is the most important thing when designing tables, because an inappropriate design can lead to detrimental consequences on performances and costs. Particular row key designs are suitable for different problems, and in this paper we analyze the performance, characteristics and applicability of each of them. In particular we investigate the effect of using various techniques for modeling row keys: sequences, salting, padding, hashing, and modulo operations. We propose four different designs based on these techniques and we analyze their performance on different HBase clusters when loading HDFS files with various sizes. The experiments show that particular designs consistently outperform others on differently sized clusters in both execution time and even load distribution across nodes.

[1]  Dominik Slezak,et al.  Key risk factors for Polish State Fire Service: A Data Mining Competition at Knowledge Pit , 2014, 2014 Federated Conference on Computer Science and Information Systems.

[2]  Sonja Filiposka,et al.  Parallel computation of information gain using Hadoop and MapReduce , 2015, 2015 Federated Conference on Computer Science and Information Systems (FedCSIS).

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[5]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[8]  Nick Dimiduk,et al.  HBase in Action , 2012 .

[9]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[10]  Hiroshi Esaki,et al.  Facility Information Management on HBase: Large-Scale Storage for Time-Series Data , 2014, 2014 IEEE 38th International Computer Software and Applications Conference Workshops.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Sonja Filiposka,et al.  Feature Ranking Based on Information Gain for Large Classification Problems with MapReduce , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[13]  Donald. Miner,et al.  MapReduce design patterns , 2012 .

[14]  Sonja Filiposka,et al.  Feature Ranking Based on Information Gain for Large Classification Problems with MapReduce , 2015, TrustCom 2015.

[15]  GhemawatSanjay,et al.  The Google file system , 2003 .

[16]  Lars George,et al.  HBase: The Definitive Guide , 2011 .

[17]  Jeff Carpenter,et al.  Cassandra: The Definitive Guide , 2010 .