HASTE: A Distributed System for Hybrid and Adaptive Processing on Streaming Spatial-Textual Data

Streaming spatial-textual data that contains geographic and textual information, e.g., geo-tagged tweets, has an unprecedented increase in amount. As one of the basic operations, the continuous spatial-textual queries that retrieve real-time results continuously on large-scale spatial-textual streams call for means of efficient distributed processing. However, existing proposals either are spatialaware only, or superficially exploit textual information for pruning. We propose a distributed system, called HASTE, for hybrid and adaptive processing on streaming spatial-textual data. The novelty lies on three aspects: (1) We propose a novel method to reduce the workload beforehand by dividing objects and queries into mutually exclusive types; (2) We develop a novel load partitioning strategy and a novel cost model that consider both spatial and textual properties; (3) We design a multi-level load adjustment strategy that adaptively copes with different degrees of load imbalance. We report on extensive experiments with real-world data that offer insight into the performance of the solution, and show that the solution is capable of outperforming the state-of-the-art proposals.

[1]  Fan Zhang,et al.  PStream: A Popularity-Aware Differentiated Distributed Stream Processing System , 2021, IEEE Transactions on Computers.

[2]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[3]  Richard M. Karp,et al.  An efficient approximation scheme for the one-dimensional bin-packing problem , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[4]  Walid G. Aref,et al.  Adaptive processing of spatial-keyword data over a distributed streaming cluster , 2017, SIGSPATIAL/GIS.

[5]  Sean Chester,et al.  Efficient top-k recently-frequent term querying over spatio-temporal textual streams , 2021, Inf. Syst..

[6]  Walid G. Aref,et al.  Tornado: A Distributed Spatio-Textual Stream Processing System , 2015, Proc. VLDB Endow..

[7]  Walid G. Aref,et al.  FAST: Frequency-Aware Indexing for Spatio-Textual Data Streams , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[8]  Walid G. Aref,et al.  SSTD: A Distributed System on Streaming Spatio-Textual Data , 2020, Proc. VLDB Endow..

[9]  Ethan L. Schreiber Optimal Multi-Way Number Partitioning , 2018, J. ACM.

[10]  Divyakant Agrawal,et al.  MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[11]  Yiqun Liu,et al.  A location-aware publish/subscribe framework for parameterized spatio-textual subscriptions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[12]  Ralf Hartmut Güting,et al.  Parallel Secondo: Boosting Database Engines with Hadoop , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[13]  Richard E. Korf,et al.  Multi-Way Number Partitioning , 2009, IJCAI.

[14]  Mohamed Sarwat,et al.  GeoSpark: a cluster computing framework for processing large-scale spatial data , 2015, SIGSPATIAL/GIS.

[15]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[16]  Fabian Hueske,et al.  Apache Flink , 2019, Encyclopedia of Big Data Technologies.

[17]  Gao Cong,et al.  An efficient query indexing mechanism for filtering geo-textual data , 2013, SIGMOD '13.

[18]  Xuemin Lin,et al.  AP-Tree: Efficiently support continuous spatial-keyword queries over stream , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[19]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[20]  Walid G. Aref,et al.  LocationSpark: A Distributed In-Memory Data Management System for Big Spatial Data , 2016, Proc. VLDB Endow..

[21]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[22]  Gao Cong,et al.  Distributed Publish/Subscribe Query Processing on the Spatio-Textual Data Stream , 2016, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[23]  Michael J. Todd,et al.  Solving combinatorial optimization problems using Karmarkar's algorithm , 1992, Math. Program..

[24]  Walid G. Aref,et al.  Cruncher: Distributed in-memory processing for location-based services , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[25]  Yang Wang,et al.  Location-aware publish/subscribe , 2013, KDD.

[26]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[27]  Guoliang Li,et al.  A Cost-based Method for Location-Aware Publish/Subscribe Services , 2015, CIKM.

[28]  Kian-Lee Tan,et al.  Temporal Spatial-Keyword Top-k publish/subscribe , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[29]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.