Prepared scan: efficient retrieval of structured data from HBase

The ability of NoSQL systems to scale better than traditional relational databases motivates a large set of applications to migrate their data to NoSQL systems, even without aiming to exploit the provided schema flexibility. However, accessing structured data is costly due to such flexibility, incurring in a lot of bandwidth and processing unit usage. In this paper, we analyse this cost in Apache HBase and propose a new scan operation, named Prepared Scan, that optimizes the access to data structured in a regular manner by taking advantage of a well-known schema by application. Using an industry standard benchmark, we show that Prepared Scan improves throughput up to 29+ and decreases network bandwidth consumption up to 20+.

[1]  Rui Liu,et al.  NoSE: Schema Design for NoSQL Applications , 2016, IEEE Transactions on Knowledge and Data Engineering.

[2]  José Pereira,et al.  An Effective Scalable SQL Engine for NoSQL Databases , 2013, DAIS.

[3]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[4]  Rick Cattell,et al.  Scalable SQL and NoSQL data stores , 2011, SGMD.

[5]  Liang Lin,et al.  Tenzing a SQL implementation on the MapReduce framework , 2011, Proc. VLDB Endow..

[6]  Pangfeng Liu,et al.  Data Partition Optimization for Column-Family NoSQL Databases , 2015, 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity).

[7]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[8]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[9]  Rui Oliveira,et al.  An Object Mapping for the Cassandra Distributed Database , 2011 .