A Hierarchical Algorithm for Clustering Uncertain Data via an Information-Theoretic Approach

In recent years there has been a growing interest in clustering uncertain data. In contrast to traditional, "sharp" data representation models, uncertain data objects can be represented in terms of an uncertainty region over which a probability density function (pdf) is defined. In this context, the focus has been mainly on partitional and density-based approaches, whereas hierarchical clustering schemes have drawn less attention. We propose a centroid-linkage-based agglomerative hierarchical algorithm for clustering uncertain objects, named U-AHC. The cluster merging criterion is based on an information-theoretic measure to compute the distance between cluster prototypes. These prototypes are represented as mixture densities that summarize the pdfs of all the uncertain objects in the clusters. Experiments have shown that our method outperforms state-of-the-art clustering algorithms from an accuracy viewpoint while achieving reasonably good efficiency.

[1]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).

[2]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[3]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[4]  Reynold Cheng,et al.  Reducing UK-Means to K-Means , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[5]  Jihoon Yang,et al.  Experimental Comparison of Feature Subset Selection Methods , 2007 .

[6]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[7]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[8]  Yufei Tao,et al.  Range search on multidimensional uncertain data , 2007, TODS.

[9]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[10]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[11]  Hans-Peter Kriegel,et al.  Approximated Clustering of Distributed High-Dimensional Data , 2005, PAKDD.

[12]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[13]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[14]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[15]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.