Efficient Distributed Range Query Processing in Apache Spark

Range queries are important in many diverse applications. In its simplest one-dimensional form, a range query is expressed by an interval [a, b] on the real line, whereas the answer consists of all elements ε in [a, b]. In this work, we focus on efficient range query processing techniques in the Apache Spark engine, which is the state-of-the-art solution for big data management and analytics. We aim at developing a Spark-based indexing scheme that supports range queries in such large-scale decentralized environments and scale well w.r.t. the number of nodes and the data items stored. Towards this goal, there have been solutions in the last few years, which however turn out to be inadequate at the envisaged scale, since the classic linear or even the logarithmic complexity (for point queries) is still too expensive, whereas range query processing is even more demanding. In this paper, we go one step further and present a solution with sub-logarithmic complexity. In particular, we present SPIS (SPark-based Interpolation Search), a tree structure that outperforms the existing Spark built-in lookup techniques. We carry out an experimental evaluation by using synthetic data sets. Our experimental results demonstrate the efficiency and scalability of the proposed approach.