论文信息 - A Bloom Filter-Based Approach for Efficient Mapreduce Query Processing on Ordered Datasets

A Bloom Filter-Based Approach for Efficient Mapreduce Query Processing on Ordered Datasets

The MapReduce processing framework is unaware of the property of underlying datasets. For ordered datasets (e.g., time-series data), in which records have been already sorted, MapReduce still performs unnecessary sorting operations during its execution. It directly results in a significant increase of execution time, as sorting a large volume of data is time-consuming. In this paper, we propose a bloom filter-based approach to improve the performance of MapReduce when processing ordered datasets. In our approach, all records are stored in a set of bloom filters after the Mapping phase and data queries can be efficiently processed by checking the bloom filters. Due to the high querying efficiency of bloom filters, we can achieve significant performance gain in the Reducing phase. We conduct a series of experiments to evaluate the effectiveness of our proposed bloom filter-based approach. Our experimental results show that our approach can achieve 2x speedup in terms of query processing performance, and reduce the CPU/memory utilization in the meanwhile. Moreover, we also evaluate the scalability of our proposed approach when processing multiple queries, and observe that the speedup can be further improved with the increasing number of queries.

[1] Cheng-Zhong Xu,et al. Interference and locality-aware task scheduling for MapReduce applications in virtual clusters , 2013, HPDC.

[2] Odysseas Papapetrou,et al. Optimizing Distributed Joins with Bloom Filters , 2008, ICDCIT.

[3] Prashant J. Shenoy,et al. A platform for scalable one-pass analytics using MapReduce , 2011, SIGMOD '11.

[4] Geoffrey C. Fox,et al. Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[5] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[6] Lixin Gao,et al. Efficient analytics on ordered datasets using MapReduce , 2013, HPDC '13.

[7] Yossi Matias,et al. Spectral bloom filters , 2003, SIGMOD '03.

[8] Yanfeng Zhang,et al. iMapReduce: A Distributed Computing Framework for Iterative Computation , 2011, Journal of Grid Computing.

[9] Wei Li,et al. A Multi-partitioning Approach to Building Fast and Accurate Counting Bloom Filters , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11] Hyoung-Joo Kim,et al. Join processing using Bloom filter in MapReduce , 2012, RACS.