MR-Cubes: On-the-Fly Computation of Location Popularity from Check-in Data Streams

Several applications in urban planning, ride-sharing or marketing, require access to the location popularity of a geographical area (e.g., city block, city, county) in near real-time and at different resolutions. To conceptualize such an access, imagine a visualization tool to view a heatmap of location popularity of a region on-the-fly as a user interacts seamlessly by zooming in and out. The access method required to enable such a seamless visualization must support: 1) updating the heatmap cells frequently as the raw data (e.g., check-ins) arrives at a high rate in a streaming fashion, and 2) splitting and merging the adjacent cells quickly to support zooming in and out, respectively. This is challenging because the most useful metric for location popularity, location entropy, requires counting the number of unique visits per user, and hence: 1) a large data structure should be maintained and updated per cell, and 2) the adjacent cells must be aggregated/disaggregated quickly while the unique visits are not additive. Due to these challenges, the previous techniques for OLAP cubes, streaming sketches and index structures are not effective. In this paper, we propose a new index structure called MR-Cube that approximates the popularity by maintaining sketches of streamed data per cell, supports time-decay for older visits and aggregates the non-additive location popularity quickly and accurately at different resolutions. We evaluate the accuracy and efficiency of MR-Cube using real-world and synthetic datasets and show its utility for our application.

[1]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[2]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR Forum.

[3]  Wolfgang Zenk-Möltgen,et al.  Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness , 2016 .

[4]  Christian S. Jensen,et al.  Scalable top-k spatio-temporal term querying , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  Saroj Kaushik,et al.  User Category Based Estimation of Location Popularity using the Road GPS Trajectory Databases , 2014 .

[6]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, Internet Math..

[7]  G. Pottie,et al.  Entropy-based sensor selection heuristic for target localization , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[8]  Graham Cormode,et al.  Exponentially Decayed Aggregates on Data Streams , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[9]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[10]  Graham Cormode,et al.  A near-optimal algorithm for estimating the entropy of a stream , 2010, TALG.

[11]  Cecilia Mascolo,et al.  Geo-spotting: mining online location-based services for optimal retail store placement , 2013, KDD.

[12]  Marios Hadjieleftheriou,et al.  Methods for finding frequent items in data streams , 2010, The VLDB Journal.

[13]  Cyrus Shahabi,et al.  A Server-Assigned Spatial Crowdsourcing Framework , 2015, ACM Trans. Spatial Algorithms Syst..

[14]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[15]  Terence R. Smith,et al.  Relative prefix sums: an efficient approach for querying dynamic OLAP data cubes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[16]  Ying Cai,et al.  Feeling-based location privacy protection for location-based services , 2009, CCS.

[17]  Kotagiri Ramamohanarao,et al.  Optimal Pick up Point Selection for Effective Ride Sharing , 2017, IEEE Transactions on Big Data.

[18]  Satish V. Ukkusuri,et al.  Optimal assignment and incentive design in the taxi group ride problem , 2017 .

[19]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[20]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[21]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[22]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[23]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[24]  Huan Liu,et al.  Mining Human Mobility in Location-Based Social Networks , 2015, Mining Human Mobility in Location-Based Social Networks.

[25]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[26]  佐藤 孝紀,et al.  A Hierarchical Data Structure for Picture Processing , 1976 .

[27]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[28]  Yong Gao,et al.  Uncovering Patterns of Inter-Urban Trip and Spatial Interaction from Social Media Check-In Data , 2013, PloS one.

[29]  Ross Maciejewski,et al.  Understanding Twitter data with TweetXplorer , 2013, KDD.

[30]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[31]  P. Killeen,et al.  A behavioral theory of timing. , 1988, Psychological review.

[32]  Aniket Kittur,et al.  Bridging the gap between physical location and online social networks , 2010, UbiComp.