SparkNN: A Distributed In-Memory Data Partitioning for KNN Queries on Big Spatial Data

The increase in GPS-enabled devices and proliferation of location-based applications have resulted in an abundance of geotagged (spatial) data. As a consequence, numerous applications have emerged that utilize the spatial data to provide different types of location-based services. However, the huge amount of available spatial data presents a challenge to the efficiency of these location-based services. Although the advent of big data frameworks like Apache Spark has enabled the processing of large amounts of data efficiently, they are designed for general (non-spatial) data. That is due to the build-in data partitioning mechanism that does not take into account the spatial proximity of the data. Therefore, these big data frameworks cannot be readily used for spatial analytics such as efficiently answering spatial queries. To fill this gap, this paper proposes SparkNN, an in-memory partitioning and indexing system for answering spatial queries, such as K-nearest neighbor, on big spatial data. SparkNN is implemented on top of Apache Spark and consists of three layers to facilitate efficient spatial queries. The first layer is a spatial-aware partitioning layer, which partitions the spatial data into several partitions ensuring that the load of the partitions is balanced and data objects with close proximity are placed in the same, or neighboring, partitions. The second layer is a local indexing layer, which provides a spatial index inside each partition to speed up the data search within the partition. The third layer is a global index, which is placed in the master node of Spark to route spatial queries to the relevant partitions. The efficiency of SparkNN was evaluated by extensive experiments with big spatial datasets. The results show SparkNN significantly outperforms the state-of-the-art Spark system when evaluated on the same set of queries.

[1]  Kai-Uwe Sattler,et al.  A framework for co-location patterns mining in big spatial data , 2017, 2017 XX IEEE International Conference on Soft Computing and Measurements (SCM).

[2]  Akifumi Makinouchi,et al.  Content-Based Image Retrieval Technique Using Wavelet-Based Shift and Brightness Invariant Edge Feature , 2003, Int. J. Wavelets Multiresolution Inf. Process..

[3]  Aoying Zhou,et al.  TrajSpark: A Scalable and Efficient In-Memory Management System for Big Trajectory Data , 2017, APWeb/WAIM.

[4]  Cyrus Shahabi,et al.  Supporting Range Queries on Web Data Using k-Nearest Neighbor Search , 2007, WebDB.

[5]  Mohamed Sarwat,et al.  Interactive and Scalable Exploration of Big Spatial Data -- A Data Management Perspective , 2015, 2015 16th IEEE International Conference on Mobile Data Management.

[6]  Xiaolan Xie,et al.  On Massive Spatial Data Retrieval Based on Spark , 2014, WAIM Workshops.

[7]  Divyakant Agrawal,et al.  MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[8]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[9]  Ryszard S. Choras,et al.  Content Based Image Retrieval Technique , 2005, CORES.

[10]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[11]  Zaher Al Aghbari,et al.  GeoSimMR: A MapReduce Algorithm for Detecting Communities based on Distance and Interest in Social Networks , 2019, Data Sci. J..

[12]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Ayoub Ait Lahcen,et al.  APRA: An approximate parallel recommendation algorithm for Big Data , 2018, Knowl. Based Syst..

[14]  Bernhard Mitschang,et al.  Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs „Datenbanken und Informationssysteme" (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings , 2017, BTW.

[15]  Lars George,et al.  HBase - The Definitive Guide: Random Access to Your Planet-Size Data , 2011 .

[16]  Erik G. Hoel,et al.  Spatial indexing and analytics on Hadoop , 2014, SIGSPATIAL/GIS.

[17]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[18]  Chi-Yin Chow,et al.  GeoSoCa: Exploiting Geographical, Social and Categorical Correlations for Point-of-Interest Recommendations , 2015, SIGIR.

[19]  Philipp Cimiano,et al.  Teaching Research Data Management for Students , 2019, Data Sci. J..

[20]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[21]  Ralf Hartmut Güting,et al.  SECONDO: A Platform for Moving Objects Database Research and for Publishing and Integrating Research Implementations , 2010, IEEE Data Eng. Bull..

[22]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[23]  Reynold Xin,et al.  Apache Spark , 2016 .

[24]  Ibrahim Kamel,et al.  Energy-efficient distributed wireless sensor network scheme for cluster detection , 2013, Int. J. Parallel Emergent Distributed Syst..

[25]  Jia Yu,et al.  GeoSpark : A Cluster Computing Framework for Processing Spatial Data , 2015 .

[26]  Zaher Al Aghbari,et al.  Automatic Segmentation based Recognition of Handwritten Arabic Words , 2011 .

[27]  Yasin Abbasi-Yadkori,et al.  Fast Approximate Nearest-Neighbor Search with k-Nearest Neighbor Graph , 2011, IJCAI.

[28]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[29]  Zhiyuan Tan,et al.  Urban data management system: Towards Big Data analytics for Internet of Things based smart urban environment using customized Hadoop , 2019, Future Gener. Comput. Syst..

[30]  Walid G. Aref,et al.  LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data , 2016, Proc. VLDB Endow..

[31]  Paolo Bellavista,et al.  In-memory Spatial-Aware Framework for Processing Proximity-Alike Queries in Big Spatial Data , 2018, 2018 IEEE 23rd International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD).

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Ahmed M. Khedr,et al.  Distributed trajectory design for data gathering using mobile sink in wireless sensor networks , 2018, AEU - International Journal of Electronics and Communications.

[34]  Dharma P. Agrawal,et al.  Opportunistically Exploiting Internet of Things for Wireless Sensor Network Routing in Smart Cities , 2018, J. Sens. Actuator Networks.

[35]  Christopher N. Eichelberger,et al.  GeoMesa: a distributed architecture for spatio-temporal fusion , 2015, Defense + Security Symposium.

[36]  Kostas E. Psannis,et al.  Social networking data analysis tools & challenges , 2016, Future Gener. Comput. Syst..

[37]  Le Gruenwald,et al.  Large-scale spatial join query processing in Cloud , 2015, 2015 31st IEEE International Conference on Data Engineering Workshops.

[38]  Kai-Uwe Sattler,et al.  The STARK Framework for Spatio-Temporal Data Analytics on Spark , 2017, BTW.

[39]  Ralf Hartmut Güting,et al.  Parallel SECONDO: A practical system for large-scale processing of moving objects , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[40]  Ibrahim Kamel,et al.  On clustering large number of data streams , 2012, Intell. Data Anal..