Efficient Retrieval of Top-K Most Similar Users from Travel Smart Card Data

Understanding the dynamics of human daily mobility patterns is essential for the management and planning of urban facilities and services. Travel smart cards, which record users' public transporting histories, capture rich information of users' mobility pattern. This provides the opportunity to discover valuable knowledge from these transaction records. In recent years, research on measuring user similarity for behavior analysis has attracted a lot of attention in applications such as recommendation systems, crowd behavior analysis applications, and numerous data mining tasks. In this paper, our goal is to estimate the similarity between users' travel patterns according to their travel smart card data. The core of our proposal is a novel user similarity measurement, namely, Travel Spatial-Temporal Similarity (TST), which measures the spatial range and temporal similarity between users. Moreover, we also propose a hybrid index structure, which integrates inverted files and cluster-based partitioning, to allow for efficient retrieval of the top-K most similar users. Through experimental evaluation, our proposed approach is shown to deliver scalable performance.

[1]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[2]  Xing Xie,et al.  Mining Individual Life Pattern Based on Location History , 2009, 2009 Tenth International Conference on Mobile Data Management: Systems, Services and Middleware.

[3]  Albert-László Barabási,et al.  Understanding individual human mobility patterns , 2008, Nature.

[4]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[5]  Kai Zheng,et al.  Calibrating trajectory data for similarity-based analysis , 2013, SIGMOD '13.

[6]  Xing Xie,et al.  Hybrid index structures for location-based web search , 2005, CIKM '05.

[7]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[8]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Xing Xie,et al.  Mining interesting locations and travel sequences from GPS trajectories , 2009, WWW '09.

[10]  Jayant Madhavan,et al.  Socialising Data with Google Fusion Tables , 2010, IEEE Data Eng. Bull..

[11]  Lei Chen,et al.  Robust and fast similarity search for moving object trajectories , 2005, SIGMOD '05.

[12]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[13]  Nick Littlestone,et al.  Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm , 2004, Machine Learning.

[14]  Ken C. K. Lee,et al.  IR-Tree: An Efficient Index for Geographic Document Search , 2011, IEEE Trans. Knowl. Data Eng..

[15]  Naphtali Rishe,et al.  Keyword Search on Spatial Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[16]  Liang Liu,et al.  Understanding individual and collective mobility patterns from smart card records: A case study in Shenzhen , 2009, 2009 12th International IEEE Conference on Intelligent Transportation Systems.

[17]  Divesh Srivastava,et al.  Weighted Set-Based String Similarity , 2010, IEEE Data Eng. Bull..

[18]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[19]  Haibin Ling,et al.  An Efficient Earth Mover's Distance Algorithm for Robust Histogram Comparison , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[21]  Michael Werman,et al.  A Linear Time Histogram Metric for Improved SIFT Matching , 2008, ECCV.

[22]  Wei-Ying Ma,et al.  Recommending friends and locations based on individual location history , 2011, ACM Trans. Web.

[23]  Mohamed A. Soliman,et al.  Top-k Query Processing in Uncertain Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  Lei Chen,et al.  On The Marriage of Lp-norms and Edit Distance , 2004, VLDB.

[25]  Beng Chin Ooi,et al.  Indexing the Distance: An Efficient Method to KNN Processing , 2001, VLDB.

[26]  Beng Chin Ooi,et al.  Collective spatial keyword querying , 2011, SIGMOD '11.

[27]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[28]  Xing Xie,et al.  Mining user similarity based on location history , 2008, GIS '08.

[29]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[30]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Xing Xie,et al.  Discovering regions of different functions in a city using human mobility and POIs , 2012, KDD.

[32]  Beng Chin Ooi,et al.  iDistance: An adaptive B+-tree based indexing method for nearest neighbor search , 2005, TODS.