The enormous amounts of data being generated regularly means that rapidly accessing relevant data from data stores is just as important as its storage. This study focuses on the use of a distributed bitmap indexing framework to accelerate query execution times in distributed data warehouses. Previous solutions for bitmap indexing at a distributed scale are rigid in their implementation, use a single compression algorithm, and provide their own mechanisms to store, distribute and retrieve the indices. Users are locked to their implementations even when other alternatives for compression and index storage are available or desirable. We provide an open source, lightweight, and flexible distributed bitmap indexing framework, where the mechanisms to search for keywords to index, the bitmap compression algorithm used, and the key-value store used for the indices are easily interchangeable. We demonstrate using Roaring bitmaps for compression, HBase for storing key-values, and adding an updated version of Apache Orc that uses bitmap indices to Apache Hive that although there is some runtime overhead due to index creation, the search of hashtags and their combinations in tweets can be greatly accelerated.
[1]
Arie Shoshani,et al.
Optimizing bitmap indices with efficient compression
,
2006,
TODS.
[2]
Torben Bach Pedersen,et al.
Position list word aligned hybrid: optimizing space and performance for compressed bitmaps
,
2010,
EDBT '10.
[3]
Kesheng Wu,et al.
Bitmap Indices for Data Warehouses
,
2006
.
[4]
Owen Kaser,et al.
Consistently faster and smaller compressed bitmaps with Roaring
,
2016,
Softw. Pract. Exp..
[5]
Owen Kaser,et al.
Better bitmap performance with Roaring bitmaps
,
2014,
Softw. Pract. Exp..
[6]
Burton H. Bloom,et al.
Space/time trade-offs in hash coding with allowable errors
,
1970,
CACM.
[7]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[8]
Hairong Kuang,et al.
The Hadoop Distributed File System
,
2010,
2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).
[9]
Lidan Shou,et al.
An efficient and compact indexing scheme for large-scale data store
,
2013,
2013 IEEE 29th International Conference on Data Engineering (ICDE).