A Bloom Filter-Based Approach for Efficient Mapreduce Query Processing on Ordered Datasets

The MapReduce processing framework is unaware of the property of underlying datasets. For ordered datasets (e.g., time-series data), in which records have been already sorted, MapReduce still performs unnecessary sorting operations during its execution. It directly results in a significant increase of execution time, as sorting a large volume of data is time-consuming. In this paper, we propose a bloom filter-based approach to improve the performance of MapReduce when processing ordered datasets. In our approach, all records are stored in a set of bloom filters after the Mapping phase and data queries can be efficiently processed by checking the bloom filters. Due to the high querying efficiency of bloom filters, we can achieve significant performance gain in the Reducing phase. We conduct a series of experiments to evaluate the effectiveness of our proposed bloom filter-based approach. Our experimental results show that our approach can achieve 2x speedup in terms of query processing performance, and reduce the CPU/memory utilization in the meanwhile. Moreover, we also evaluate the scalability of our proposed approach when processing multiple queries, and observe that the speedup can be further improved with the increasing number of queries.