论文信息 - Efficient Distributed Range Query Processing in Apache Spark

Efficient Distributed Range Query Processing in Apache Spark

Range queries are important in many diverse applications. In its simplest one-dimensional form, a range query is expressed by an interval [a, b] on the real line, whereas the answer consists of all elements ε in [a, b]. In this work, we focus on efficient range query processing techniques in the Apache Spark engine, which is the state-of-the-art solution for big data management and analytics. We aim at developing a Spark-based indexing scheme that supports range queries in such large-scale decentralized environments and scale well w.r.t. the number of nodes and the data items stored. Towards this goal, there have been solutions in the last few years, which however turn out to be inadequate at the envisaged scale, since the classic linear or even the logarithmic complexity (for point queries) is still too expensive, whereas range query processing is even more demanding. In this paper, we go one step further and present a solution with sub-logarithmic complexity. In particular, we present SPIS (SPark-based Interpolation Search), a tree structure that outperforms the existing Spark built-in lookup techniques. We carry out an experimental evaluation by using synthetic data sets. Our experimental results demonstrate the efficiency and scalability of the proposed approach.

[1] Kurt Mehlhorn,et al. Dynamic Interpolation Search , 1985, ICALP.

[2] Mark Handley,et al. A scalable content-addressable network , 2001, SIGCOMM '01.

[3] Michael T. Goodrich,et al. The rainbow skip graph: a fault-tolerant constant-degree distributed data structure , 2006, SODA '06.

[4] Alon Itai,et al. Interpolation search—a log logN search , 1978, CACM.

[5] W. W. Peterson,et al. Addressing for Random-Access Storage , 1957, IBM J. Res. Dev..

[6] Robert Morris,et al. Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[7] Dan E. Willard,et al. Maintaining dense sequential files in a dynamic environment (Extended Abstract) , 1982, STOC '82.

[8] Beng Chin Ooi,et al. Speeding up search in peer-to-peer networks with a multi-way tree structure , 2006, SIGMOD Conference.

[9] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10] Beng Chin Ooi,et al. BATON: A Balanced Tree Structure for Peer-to-Peer Networks , 2005, VLDB.

[11] Muthu Dayalan,et al. MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[12] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[13] David R. Karger,et al. Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[14] Maha Abdallah,et al. Scalable Range Query Processing for Large-Scale Distributed Database Applications , 2005, IASTED PDCS.