On the use of Wasserstein metric in topological clustering of distributional data

This paper deals with a clustering algorithm for histogram data based on a Self-Organizing Map (SOM) learning. It combines a dimension reduction by SOM and the clustering of the data in a reduced space. Related to the kind of data, a suitable dissimilarity measure between distributions is introduced: the L2 Wasserstein distance. Moreover, the number of clusters is not fixed in advance but it is automatically found according to a local data density estimation in the original space. Applications on synthetic and real data sets corroborate the proposed strategy.

[1]  Antonio Irpino,et al.  Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation , 2007, EGC.

[2]  Antonio Irpino,et al.  Dynamic clustering of interval data using a Wasserstein-based distance , 2008, Pattern Recognit. Lett..

[3]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[4]  Victor M. Panaretos,et al.  Amplitude and phase variation of point processes , 2016, 1603.08691.

[5]  E.L.J. Bohez Two level cluster analysis based on fractal dimension and iterated function systems (IFS) for speech signal recognition , 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242).

[6]  J. Arroyo,et al.  Forecasting histogram time series with k-nearest neighbours methods , 2009 .

[7]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[8]  Paula Brito,et al.  Linear regression model with histogram‐valued variables , 2015, Stat. Anal. Data Min..

[9]  Antonio Irpino,et al.  Basic statistics for distributional symbolic variables: a new metric-based approach , 2011, Advances in Data Analysis and Classification.

[10]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[11]  Younès Bennani,et al.  A local density-based simultaneous two-level algorithm for topographic clustering , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[12]  Ping Li,et al.  Using Greedy algorithm: DBSCAN revisited II , 2004, Journal of Zhejiang University. Science.

[13]  Y. Lechevallier,et al.  Dynamic clustering of histograms using Wasserstein metric , 2006 .

[14]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[15]  Emin Erkan Korkmaz,et al.  A Two-Level Clustering Method Using Linear Linkage Encoding , 2006, PPSN.

[16]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[17]  B. Silverman,et al.  Using Kernel Density Estimates to Investigate Multimodality , 1981 .

[18]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[19]  Younès Bennani,et al.  A new topological clustering algorithm for interval data , 2013, Pattern Recognit..

[20]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .

[21]  Antonio Irpino,et al.  Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance , 2012, Advances in Data Analysis and Classification.

[22]  C. Mallows A Note on Asymptotic Joint Normality , 1972 .

[23]  Chellu Chandra Sekhar,et al.  Local Density Estimation based Clustering , 2007, 2007 International Joint Conference on Neural Networks.

[24]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[25]  Antonio Irpino,et al.  A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data , 2006, Data Science and Classification.

[26]  Luc Vincent,et al.  Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Younès Bennani,et al.  Enriched topological learning for cluster detection and visualization , 2012, Neural Networks.

[28]  Francisco de A. T. de Carvalho,et al.  Batch SOM algorithms for interval-valued data with automatic weighting of the variables , 2016, Neurocomputing.

[29]  Mohamed S. Kamel,et al.  An Efficient Two-Level SOMART Document Clustering Through Dimensionality Reduction , 2004, ICONIP.

[30]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .