Comparing Histogram Data Using a Mahalanobis–Wasserstein Distance

In this paper, we present a new distance for comparing data described by histograms. The distance is a generalization of the classical Mahalanobis distance for data described by correlated variables. We define a way to extend the classical concept of inertia and codeviance from a set of points to a set of data described by histograms. The same results are also presented for data described by continuous density functions (empiric or estimated). An application to real data is performed to illustrate the effects of the new distance using dynamic clustering.

[1]  Antonio Irpino,et al.  A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data , 2006, Data Science and Classification.

[2]  P. Bertrand,et al.  Descriptive Statistics for Symbolic Data , 2000 .

[3]  M. Chavent,et al.  Trois nouvelles méthodes de classification automatique de données symboliques de type intervalle , 2003 .

[4]  E. Diday Une nouvelle méthode en classification automatique et reconnaissance des formes la méthode des nuées dynamiques , 1971 .

[5]  Joffray Baune,et al.  Clustering and Validation of Interval Data, Selected contributions in Data Analysis and Classification, P. Brito, P. Bertrand, G. Cucumel, F. DE Carvalho (Eds), Springer, 69-82 , 2007 .

[6]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .

[7]  Lynne Billard,et al.  Dependencies and Variation Components of Symbolic Interval-Valued Data , 2007 .

[8]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[9]  Vladimir Batagelj,et al.  Data Science and Classification , 2006, Studies in Classification, Data Analysis, and Knowledge Organization.

[10]  Antonio Irpino,et al.  Dynamic Clustering of Histogram Data: Using the Right Metric , 2007 .

[11]  King-Sun Fu,et al.  Digital pattern recognition , 1976, Communication and cybernetics.

[12]  Antonio Irpino,et al.  Optimal histogram representation of large data sets: Fisher vs piecewise linear approximation , 2007, EGC.

[13]  Paula Brito,et al.  On the Analysis of Symbolic Data , 2007 .

[14]  Carlos Matrán,et al.  Optimal Transportation Plans and Convergence in Distribution , 1997 .

[15]  Francisco de A. T. de Carvalho,et al.  Selected Contributions in Data Analysis and Classification , 2007 .