Dissimilarity measures for histogram-valued data and divisive clustering of symbolic objects

Contemporary datasets are becoming increasingly larger and more complex, while techniques to analyse them are becoming more and more inadequate. Thus, new methods are needed to handle these new types of data. This study introduces methods to cluster histogram-valued data. However, histogram-valued data are difficult to handle computationally because observations typically have a different number and length of subintervals. Thus, a transformation for histogram data is proposed as a technique for handling them more easily computationally. From this technique, three new dissimilarity measures for histogram data are proposed. Then, how the monothetic clustering algorithm based on Chavent (1998, 2000) can be extended to histogram data is shown, and a polythetic clustering algorithm for symbolic objects is developed (based on all p variables). Validity criteria to aid in the selection of the optimal number of clusters are described and verified by some simulation studies. The new methodology is illustrated on a large dataset collected from the US Forestry Service. Index words: Symbolic data, Histogram-valued data, Dissimilarity measure, Monothetic algorithm, Polythetic algorithm, Validity Dissimilarity Measures for Histogram-valued Data and Divisive Clustering of Symbolic Objects

[1]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[2]  Mia Hubert,et al.  Clustering in an object-oriented environment , 1997 .

[3]  Edwin Diday,et al.  Probabilist, possibilist and belief objects for knowledge analysis , 1995, Ann. Oper. Res..

[4]  Theodore Johnson,et al.  Squashing flat files flatter , 1999, KDD '99.

[5]  V. V. Strelkov,et al.  A new similarity measure for histogram comparison and its application in time series analysis , 2008, Pattern Recognit. Lett..

[6]  W. T. Williams,et al.  Dissimilarity Analysis: a new Technique of Hierarchical Sub-division , 1964, Nature.

[7]  Victor L. Brailovsky,et al.  Probabilistic validation approach for clustering , 1995, Pattern Recognit. Lett..

[8]  Francisco de A. T. de Carvalho,et al.  Proximity Coefficients between Boolean symbolic objects , 1994 .

[9]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[10]  K. Chidananda Gowda,et al.  Symbolic clustering using a new similarity measure , 1992, IEEE Trans. Syst. Man Cybern..

[11]  E. Diday,et al.  Une généralisation des arbres hiérarchiques: les représentations pyramidales , 1990 .

[12]  Donato Malerba,et al.  Comparing Dissimilarity Measures for Symbolic Data Analysis , 2001 .

[13]  Javier Arroyo Gallardo,et al.  Forecasting histogram time series with k-nearest neighbours methods , 2009 .

[14]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[15]  Alfredo Rizzi,et al.  Metrics in Symbolic Data Analysis , 2005 .

[16]  Hans-Hermann Bock,et al.  Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data , 2000 .

[17]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Manabu Ichino,et al.  Generalized Minkowski metrics for mixed feature-type data analysis , 1994, IEEE Trans. Syst. Man Cybern..

[19]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[20]  Sung-Hyuk Cha,et al.  On measuring the distance between histograms , 2002, Pattern Recognit..

[21]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[22]  M. Ichino General Metrics For Mixed Features The Cartesian Space Theory For Pattern Recognition , 1988, Proceedings of the 1988 IEEE International Conference on Systems, Man, and Cybernetics.

[23]  K. Chidananda Gowda,et al.  Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity , 1995, Pattern Recognit..

[24]  Edwin Diday,et al.  Unsupervised learning through symbolic clustering , 1991, Pattern Recognit. Lett..

[25]  Art B. Owen,et al.  Data Squashing by Empirical Likelihood , 2004, Data Mining and Knowledge Discovery.

[26]  G. N. Lance,et al.  Note on a New Information-Statistic Classificatory Program , 1968, Comput. J..

[27]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[28]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[29]  A W EDWARDS,et al.  A METHOD FOR CLUSTER ANALYSIS. , 1965, Biometrics.

[30]  Edwin Diday,et al.  Symbolic Data Analysis: Conceptual Statistics and Data Mining (Wiley Series in Computational Statistics) , 2007 .

[31]  Edwin Diday Introduction à l'approche symbolique en analyse des données , 1989 .

[32]  Edwin Diday From Data to Knowledge: Probabilist Objects for a Symbolic Data Analysis , 1993, Partitioning Data Sets.

[33]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[34]  L. Billard,et al.  From the Statistics of Data to the Statistics of Knowledge , 2003 .

[35]  Einoshin Suzuki,et al.  Data Squashing for Speeding Up Boosting-Based Outlier Detection , 2002, ISMIS.

[36]  G. N. Lance,et al.  Computer Programs for Hierarchical Polythetic Classification ("Similarity Analyses") , 1966, Comput. J..

[37]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[38]  Marie Chavent,et al.  A monothetic clustering method , 1998, Pattern Recognit. Lett..

[39]  Christian Posse,et al.  Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction , 2002, Data Mining and Knowledge Discovery.

[40]  Antonio Irpino,et al.  A New Wasserstein Based Distance for the Hierarchical Clustering of Histogram Symbolic Data , 2006, Data Science and Classification.

[41]  Mohamed A. Ismail,et al.  Multidimensional data clustering utilizing hybrid search strategies , 1989, Pattern Recognit..

[42]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[43]  Paula Brito Symbolic objects: order structure and pyramidal clustering , 1995, Ann. Oper. Res..

[44]  Minho Kim,et al.  New indices for cluster validity assessment , 2005, Pattern Recognit. Lett..

[45]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .