Fuzzy clustering of distributional data with automatic weighting of variable components

Distributional data, expressed as realizations of distributional variables, are new types of data arising from several sources. In this paper, we present some new fuzzy c-means algorithms for data described by distributional variables. The algorithms use the L2 Wasserstein distance between distributions as dissimilarity measure. Usually, in fuzzy c-means, all the variables are considered equally important in the clustering task. However, some variables could be more or less important or even irrelevant for this task. Considering a decomposition of the squared L2 Wasserstein distance, and using the notion of adaptive distance, we propose some algorithms for automatically computing relevance weights associated with variables, as well as with their components. This is done for the whole dataset or cluster-wise. Relevance weights express the importance of each variable, or of each component, in the clustering process acting also as a variable selection method. Using artificial and real-world data, we observed that algorithms with automatic weighting of variables (or components) are better able to take into account the cluster structure of data.

[1]  Yves Lechevallier,et al.  Dynamic Clustering of Interval-Valued Data Based on Adaptive Quadratic Distances , 2009, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[2]  Francisco de A. T. de Carvalho,et al.  Dynamic clustering of histogram data based on adaptive squared Wasserstein distances , 2011, Expert Syst. Appl..

[3]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[4]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Francisco de A. T. de Carvalho,et al.  Fuzzy clustering of distribution-valued data using adaptive L2 Wasserstein distances , 2016, 1605.00513.

[6]  Antonio Irpino,et al.  Dynamic Clustering of Histogram Data: Using the Right Metric , 2007 .

[7]  Antonio Irpino,et al.  Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation , 2007, EGC.

[8]  Partitional clustering algorithms for symbolic interval data based on single adaptive distances , 2009 .

[9]  André Hardy,et al.  Clustering of Symbolic Objects Described by Multi-Valued and Modal Variables , 2004 .

[10]  Lynne Billard,et al.  A polythetic clustering process and cluster validity indexes for histogram-valued objects , 2011, Comput. Stat. Data Anal..

[11]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  L. Billard,et al.  Dissimilarity Measures for Histogram-valued Observations , 2013 .

[13]  Hichem Frigui,et al.  Unsupervised learning of prototypes and attribute weights , 2004, Pattern Recognit..

[14]  Ricardo J. G. B. Campello,et al.  A fuzzy extension of the silhouette width criterion for cluster analysis , 2006, Fuzzy Sets Syst..

[15]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[16]  Vladimir Batagelj,et al.  Symbolic Data Analysis Approach to Clustering Large Datasets , 2002 .

[17]  Azriel Rosenfeld,et al.  A distance metric for multidimensional histograms , 1985, Comput. Vis. Graph. Image Process..

[18]  Antonio Irpino,et al.  A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data , 2006, Data Science and Classification.

[19]  Mathieu Vrac,et al.  Copula analysis of mixture models , 2012, Comput. Stat..

[20]  Vladimir Batagelj,et al.  Clustering Large Datasets of Mixed Units , 1998 .

[21]  Rajesh N. Davé,et al.  Validating fuzzy partitions obtained through c-shells clustering , 1996, Pattern Recognit. Lett..

[22]  James M. Keller,et al.  Fuzzy Models and Algorithms for Pattern Recognition and Image Processing , 1999 .

[23]  Hichem Frigui,et al.  Clustering and aggregation of relational data with applications to image database categorization , 2007, Pattern Recognit..

[24]  Doheon Lee,et al.  A novel initialization scheme for the fuzzy c-means algorithm for color clustering , 2004, Pattern Recognit. Lett..

[25]  Antonio Irpino,et al.  Basic statistics for distributional symbolic variables: a new metric-based approach , 2011, Advances in Data Analysis and Classification.

[26]  Vladimir Batagelj,et al.  A weighted clustering of population pyramids for the world's countries, 1996, 2001, 2006 , 2015, Population studies.

[27]  C. Givens,et al.  A class of Wasserstein metrics for probability distributions. , 1984 .

[28]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[29]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .

[30]  David Hulse,et al.  Willamette River Basin planning atlas : trajectories of environmental and ecological change , 2002 .

[31]  Francisco de A. T. de Carvalho,et al.  Fuzzy co-clustering with automated variable weighting , 2015, 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[32]  Francisco de A. T. de Carvalho,et al.  Partitional fuzzy clustering methods based on adaptive quadratic distances , 2006, Fuzzy Sets Syst..

[33]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[34]  W. Gilchrist,et al.  Statistical Modelling with Quantile Functions , 2000 .

[35]  Francisco de A. T. de Carvalho,et al.  Unsupervised pattern recognition models for mixed feature-type symbolic data , 2010, Pattern Recognit. Lett..

[36]  Yves Lechevallier,et al.  Partitional clustering algorithms for symbolic interval data based on single adaptive distances , 2009, Pattern Recognit..

[37]  Antonio Irpino,et al.  Comparing Histogram Data Using a Mahalanobis–Wasserstein Distance , 2008 .