Self-adapted mixture distance measure for clustering uncertain data

Distance measure plays an important role in clustering uncertain data. However, existing distance measures for clustering uncertain data suffer from some issues. Geometric distance measure can not identify the difference between uncertain objects with different distributions heavily overlapping in locations. Probability distribution distance measure can not distinguish the difference between different pairs of completely separated uncertain objects. In this paper, we propose a self-adapted mixture distance measure for clustering uncertain data which considers the geometric distance and the probability distribution distance simultaneously, thus overcoming the issues in previous distance measures. The proposed distance measure consists of three parts: (1) The induced kernel distance: it can be used to measure the geometric distance between uncertain objects. (2) The JensenShannon divergence: it can be used to measure the probability distribution distance between uncertain objects. (3) The self-adapted weight parameter: it can be used to adjust the importance degree of the induced kernel distance and the JensenShannon divergence according to the location overlapping information of the dataset. The proposed distance measure is symmetric, finite and parameter adaptive. Furthermore, we integrate the self-adapted mixture distance measure into the partition-based and density-based algorithms for clustering uncertain data. Extensive experimental results on synthetic datasets, real benchmark datasets and real world uncertain datasets show that our proposed distance measure outperforms the existing distance measures for clustering uncertain data.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[3]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[4]  Wei Hong,et al.  Model-based approximate querying in sensor networks , 2005, The VLDB Journal.

[5]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[6]  Ninoslav Slavek,et al.  Improved Bisector pruning for uncertain data mining , 2012, Proceedings of the ITI 2012 34th International Conference on Information Technology Interfaces.

[7]  William W. Hargrove,et al.  Use of the Köppen–Trewartha climate classification to evaluate climatic refugia in statistically derived ecoregions for the People’s Republic of China , 2009 .

[8]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[9]  David Wai-Lok Cheung,et al.  Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index , 2010, IEEE Transactions on Knowledge and Data Engineering.

[10]  Jianpei Zhang,et al.  Dynamic density-based clustering algorithm over uncertain data streams , 2012, 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery.

[11]  Peng Cui,et al.  A Density Grid-Based Clustering Algorithm for Uncertain Data Streams , 2013, 2013 10th Web Information System and Application Conference.

[12]  Thomas Seidl,et al.  Subspace Clustering for Uncertain Data , 2010, SDM.

[13]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[14]  Xianchao Zhang,et al.  Novel Density-Based Clustering Algorithms for Uncertain Data , 2014, AAAI.

[15]  Andrea Tagarelli,et al.  Minimizing the Variance of Cluster Mixture Models for Clustering Uncertain Objects , 2010, ICDM.

[16]  Edward Y. Chang,et al.  Formulating distance functions via the kernel trick , 2005, KDD '05.

[17]  Arthur Zimek,et al.  Representative clustering of uncertain data , 2014, KDD.

[18]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[19]  Andrea Tagarelli,et al.  Uncertain Centroid based Partitional Clustering of Uncertain Data , 2012, Proc. VLDB Endow..

[20]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[21]  Hal Daumé,et al.  Co-regularized Multi-view Spectral Clustering , 2011, NIPS.

[22]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[23]  Xianchao Zhang,et al.  Smart Multitask Bregman Clustering and Multitask Kernel Clustering , 2015, ACM Trans. Knowl. Discov. Data.

[24]  Neil D. Lawrence,et al.  A tractable probabilistic model for Affymetrix probe-level analysis across multiple chips , 2005, Bioinform..

[25]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[26]  David Wai-Lok Cheung,et al.  Clustering Uncertain Data Using Voronoi Diagrams , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[27]  Robert P. W. Duin,et al.  A Generalized Kernel Approach to Dissimilarity-based Classification , 2002, J. Mach. Learn. Res..

[28]  Hong Yu,et al.  Constraint Based Subspace Clustering for High Dimensional Uncertain Data , 2016, PAKDD.

[29]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[30]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[31]  Andrea Tagarelli,et al.  Clustering Uncertain Data Via K-Medoids , 2008, SUM.

[32]  Reynold Cheng,et al.  Reducing UK-Means to K-Means , 2007 .

[33]  Bin Jiang,et al.  Clustering Uncertain Data Based on Probability Distribution Similarity , 2013, IEEE Transactions on Knowledge and Data Engineering.

[34]  Sara Klingenstein,et al.  Bootstrap Methods for the Empirical Study of Decision-Making and Information Flows in Social Systems , 2013, Entropy.

[35]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[36]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[37]  Larry S. Davis,et al.  Improved fast gauss transform and efficient kernel density estimation , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[38]  Reynold Cheng,et al.  Metric and trigonometric pruning for clustering of uncertain data in 2D geometric space , 2011, Inf. Syst..

[39]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.