Top-k spatial-keyword publish/subscribe over sliding window

With the prevalence of social media and GPS-enabled devices, a massive amount of geo-textual data have been generated in a stream fashion, leading to a variety of applications such as location-based recommendation and information dissemination. In this paper, we investigate a novel real-time top-$$k$$k monitoring problem over sliding window of streaming data; that is, we continuously maintain the top-k most relevant geo-textual messages (e.g., geo-tagged tweets) for a large number of spatial-keyword subscriptions (e.g., registered users interested in local events) simultaneously. To provide the most recent information under controllable memory cost, sliding window model is employed on the streaming geo-textual data. To the best of our knowledge, this is the first work to study top-$$k$$k spatial-keyword publish/subscribe over sliding window. A novel centralized system, called Skype (Top-kSpatial-keyword Publish/Subscribe), is proposed in this paper. In Skype, to continuously maintain top-$$k$$k results for massive subscriptions, we devise a novel indexing structure upon subscriptions such that each incoming message can be immediately delivered on its arrival. To reduce the expensive top-$$k$$k re-evaluation cost triggered by message expiration, we develop a novel cost-basedk-skyband technique to reduce the number of re-evaluations in a cost-effective way. Extensive experiments verify the great efficiency and effectiveness of our proposed techniques. Furthermore, to support better scalability and higher throughput, we propose a distributed version of Skype, namely DSkype, on top of Storm, which is a popular distributed stream processing system. With the help of fine-tuned subscription/message distribution mechanisms, DSkype can achieve orders of magnitude speed-up than its centralized version.

[1]  Minyi Guo,et al.  Simba: Efficient In-Memory Spatial Analytics , 2016, SIGMOD Conference.

[2]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[3]  Karl Aberer,et al.  Time- and Space-Efficient Sliding Window Top-k Query Processing , 2015, TODS.

[4]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[5]  Tao Guo,et al.  Efficient Algorithms for Answering the m-Closest Keywords Query , 2015, SIGMOD Conference.

[6]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[7]  Jeffrey Xu Yu,et al.  Duplicate-Insensitive Order Statistics Computation over Data Streams , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Marcus Fontoura,et al.  Top-k Publish-Subscribe for Social Annotation of News , 2013, Proc. VLDB Endow..

[9]  Kian-Lee Tan,et al.  Processing spatial keyword query as a top-k aggregation query , 2014, SIGIR.

[10]  Divyakant Agrawal,et al.  MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[11]  Hans-Arno Jacobsen,et al.  BE-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space , 2011, SIGMOD '11.

[12]  Naphtali Rishe,et al.  Keyword Search on Spatial Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Walid G. Aref,et al.  Tornado: A Distributed Spatio-Textual Stream Processing System , 2015, Proc. VLDB Endow..

[14]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[15]  Gao Cong,et al.  An efficient query indexing mechanism for filtering geo-textual data , 2013, SIGMOD '13.

[16]  Kian-Lee Tan,et al.  Location-Aware Pub/Sub System: When Continuous Moving Queries Meet Dynamic Event Streams , 2015, SIGMOD Conference.

[17]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[18]  Ahmed Eldawy,et al.  SpatialHadoop: A MapReduce framework for spatial data , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[19]  Jiaheng Lu,et al.  Reverse spatial and textual k nearest neighbor search , 2011, SIGMOD '11.

[20]  Kyriakos Mouratidis,et al.  Continuous monitoring of top-k queries over sliding windows , 2006, SIGMOD Conference.

[21]  Beng Chin Ooi,et al.  Efficiently Processing Continuous k-NN Queries on Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  PripužićKrešimir,et al.  Time- and Space-Efficient Sliding Window Top-k Query Processing , 2015 .

[23]  Xuemin Lin,et al.  AP-Tree: Efficiently support continuous spatial-keyword queries over stream , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[24]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[25]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[26]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[27]  Torsten Suel,et al.  Faster top-k document retrieval using block-max indexes , 2011, SIGIR.

[28]  Marina Fruehauf,et al.  Nonlinear Programming Analysis And Methods , 2016 .

[29]  Christian S. Jensen,et al.  Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects , 2009, Proc. VLDB Endow..

[30]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Yang Wang,et al.  Location-aware publish/subscribe , 2013, KDD.

[32]  Kian-Lee Tan,et al.  An Efficient Publish/Subscribe Index for ECommerce Databases , 2014, Proc. VLDB Endow..

[33]  Kian-Lee Tan,et al.  Temporal Spatial-Keyword Top-k publish/subscribe , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[34]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[35]  Sergei Vassilvitskii,et al.  Indexing Boolean Expressions , 2009, Proc. VLDB Endow..

[36]  Torsten Suel,et al.  Text vs. space: efficient geo-search query processing , 2011, CIKM '11.

[37]  Haim Levkowitz,et al.  Introduction to information retrieval (IR) , 2008 .

[38]  João B. Rocha-Junior,et al.  Efficient Processing of Top-k Spatial Keyword Queries , 2011, SSTD.

[39]  Christian S. Jensen,et al.  Spatial Keyword Query Processing: An Experimental Evaluation , 2013, Proc. VLDB Endow..

[40]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[41]  Chen Li,et al.  Processing Spatial-Keyword (SK) Queries in Geographic Information Retrieval (GIR) Systems , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[42]  Rajiv Ranjan,et al.  Streaming Big Data Processing in Datacenter Clouds , 2014, IEEE Cloud Computing.

[43]  Yiqun Liu,et al.  A location-aware publish/subscribe framework for parameterized spatio-textual subscriptions , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[44]  Joel H. Saltz,et al.  Hadoop-GIS: A High Performance Spatial Data Warehousing System over MapReduce , 2013, Proc. VLDB Endow..

[45]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[46]  Yuguo Chen,et al.  Efficient maintenance of materialized top-k views , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[47]  Xing Xie,et al.  Hybrid index structures for location-based web search , 2005, CIKM '05.

[48]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[49]  Kyriakos Mouratidis,et al.  Efficient Evaluation of Continuous Text Search Queries , 2011, IEEE Transactions on Knowledge and Data Engineering.