Parallel clustering of big data of spatio-temporal trajectory

Generally, computing efficiency of many spatial data analysis algorithm will sharply decline as data size increase. It is very meaningful for extending the analysis method of spatial data and enhancing computational efficiency by introducing the distributed parallel computing model. Considering the features of spatio-temporal trajectory data, which is massive, related to time and dynamic, we proposed the fast calculation method of the trajectory similarity based on coarse-grained Dynamic Time Warping. The algorithm will reduce the consuming time greatly when the length of trajectory sequences are very long. We also proposed the parallel trajectory clustering strategy of big data under the Hadoop MapReduce model in this paper. The big data of trajectory are sliced, and the trajectory similarity and the iteration computation of cluster center are dealt with by multiwork nodes simultaneously. The experimental results of the parallel trajectory clustering, which based on the open source project Mahout, implemented on the vehicle trajectory data show that the clustering results are valid. The computing performance of parallel clustering are obviously improved as the trajectory data size increases. And the new parallel clustering method outperforms the traditional algorithm like k-means algorithm.

[1]  Lipo Wang,et al.  Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance , 2003, IEEE Trans. Syst. Man Cybern. Part B.

[2]  Zhong Ming,et al.  On Spatial Data Mining under Big Data , 2013 .

[3]  Bing Liu,et al.  An efficient semi-unsupervised gene selection method via spectral biclustering , 2006, IEEE Transactions on NanoBioscience.

[4]  Muhammed Naeem Ahmed Khan,et al.  An Incremental Density-Based Clustering Technique for Large Datasets , 2010, CISIS.

[5]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[6]  Santosh Biswas,et al.  Distance Based Fast Hierarchical Clustering Method for Large Datasets , 2010, RSCTC.

[7]  Keqiu Li,et al.  Optimized big data K-means clustering using MapReduce , 2014, The Journal of Supercomputing.

[8]  Meng Xiaofeng and Ci Xiang,et al.  Big Data Management: Concepts,Techniques and Challenges , 2013 .

[9]  Lu Fen Big Data and Generalized GIS , 2014 .

[10]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[11]  Donald J. Berndt,et al.  Finding Patterns in Time Series: A Dynamic Programming Approach , 1996, Advances in Knowledge Discovery and Data Mining.

[12]  Heng Li,et al.  Parallel Based on Cloud Computing to Achieve Large Data Sets Clustering , 2012, 2012 International Conference on Computer Science and Electronics Engineering.

[13]  Nikos Pelekis,et al.  Unsupervised Trajectory Sampling , 2010, ECML/PKDD.

[14]  Elio Masciari Trajectory Clustering via Effective Partitioning , 2009, FQAS.

[15]  Elio Masciari,et al.  A Framework for Trajectory Clustering , 2009, GSN.

[16]  Lipo Wang,et al.  Data Mining With Computational Intelligence , 2006, IEEE Transactions on Neural Networks.

[17]  Ying Xu,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology & Bioinformatics.

[18]  Aoying Zhou,et al.  Query processing of massive trajectory data based on mapreduce , 2009, CloudDB@CIKM.

[19]  Daewon Lee,et al.  An improved cluster labeling method for support vector clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Xiaohui Huang,et al.  A scalable and fast OPTICS for clustering trajectory big data , 2015, Cluster Computing.