Dynamic clustering of histogram data based on adaptive squared Wasserstein distances

Histogram-valued data are treating differently from bar-count data.A new clustering method for histogram-valued data is proposed.Two adaptive clustering strategy are proposed.A set of quality-of-partition indices are proposed.No other clustering method exist for histogram-valued data. This paper presents a Dynamic Clustering Algorithm for histogram data with an automatic weighting step of the variables by using adaptive distances. The Dynamic Clustering Algorithm is a k-means-like algorithm for clustering a set of objects into a predefined number of classes. Histogram data are realizations of particular set-valued descriptors defined in the context of Symbolic Data Analysis. We propose to use the ? 2 Wasserstein distance for clustering histogram data and two novel adaptive distance based clustering schemes. The ? 2 Wasserstein distance allows to express the variability of a set of histograms in two components: the first related to the variability of their averages and the second to the variability of the histograms related to different size and shape. The weighting step aims to take into account global and local adaptive distances as well as two components of the variability of a set of histograms. To evaluate the clustering results, we extend some classic partition quality indexes when the proposed adaptive distances are used in the clustering criterion function. Examples on synthetic and real-world datasets corroborate the proposed clustering procedure.

[1]  F. Famoye Continuous Univariate Distributions, Volume 1 , 1994 .

[2]  Yves Lechevallier,et al.  Dynamic Clustering of Interval-Valued Data Based on Adaptive Quadratic Distances , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[3]  Carlos Matrán,et al.  Optimal Transportation Plans and Convergence in Distribution , 1997 .

[4]  Yves Lechevallier,et al.  Partitional clustering algorithms for symbolic interval data based on single adaptive distances , 2009, Pattern Recognit..

[5]  Edwin Diday,et al.  Symbolic Data Analysis: A Mathematical Framework and Tool for Data Mining , 1999, Electron. Notes Discret. Math..

[6]  Zhaohong Deng,et al.  Enhanced soft subspace clustering integrating within-cluster and between-cluster information , 2010, Pattern Recognit..

[7]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[8]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[9]  C. Givens,et al.  A class of Wasserstein metrics for probability distributions. , 1984 .

[10]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[11]  Yunming Ye,et al.  A feature group weighting method for subspace clustering of high-dimensional data , 2012, Pattern Recognit..

[12]  Hans-Hermann Bock,et al.  Dynamic clustering for interval data based on L2 distance , 2006, Comput. Stat..

[13]  M. Cugmas,et al.  On comparing partitions , 2015 .

[14]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[15]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[16]  F. D. de Carvalho,et al.  A Clustering Method for Mixed Feature-Type Symbolic Data using Adaptive Squared Euclidean Distances , 2007, 7th International Conference on Hybrid Intelligent Systems (HIS 2007).

[17]  Mathieu Vrac,et al.  Copula analysis of mixture models , 2012, Comput. Stat..

[18]  Y. Lechevallier,et al.  Dynamic clustering of histograms using Wasserstein metric , 2006 .

[19]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[20]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data , 2000 .

[21]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[22]  C. Villani Topics in Optimal Transportation , 2003 .

[23]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[24]  C. Mallows A Note on Asymptotic Joint Normality , 1972 .

[25]  Antonio Irpino,et al.  Comparing Histogram Data Using a Mahalanobis–Wasserstein Distance , 2008 .

[26]  Edwin Diday,et al.  Symbolic Data Analysis: Conceptual Statistics and Data Mining (Wiley Series in Computational Statistics) , 2007 .

[27]  Antonio Irpino,et al.  Dynamic Clustering of Histogram Data: Using the Right Metric , 2007 .

[28]  Lipika Dey,et al.  A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets , 2011, Pattern Recognit. Lett..

[29]  Francisco de A. T. de Carvalho,et al.  Unsupervised pattern recognition models for mixed feature-type symbolic data , 2010, Pattern Recognit. Lett..

[30]  TomasiCarlo,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000 .

[31]  Chieh-Yuan Tsai,et al.  Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm , 2008, Comput. Stat. Data Anal..

[32]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[33]  F.A.T. de Carvalho,et al.  A Clustering Method for Mixed Feature-Type Symbolic Data using Adaptive Squared Euclidean Distances , 2007, HIS.

[34]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[35]  Monique Noirhomme-Fraiture,et al.  Symbolic Data Analysis and the SODAS Software , 2008 .

[36]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Angela Montanari,et al.  A hierarchical modeling approach for clustering probability density functions , 2014, Comput. Stat. Data Anal..

[38]  Antonio Irpino,et al.  A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data , 2006, Data Science and Classification.

[39]  Hichem Frigui,et al.  Unsupervised learning of prototypes and attribute weights , 2004, Pattern Recognit..

[40]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[41]  Antonio Irpino,et al.  Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation , 2007, EGC.

[42]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .