A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data

Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.

[1]  Vytautas Perlibakas,et al.  Distance measures for PCA-based face recognition , 2004, Pattern Recognit. Lett..

[2]  G. Cumming Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis , 2011 .

[3]  Fernando C. Lourenço,et al.  Binary-based similarity measures for categorical data and their application in Self- Organizing Maps , 2004 .

[4]  Craig Caulfield,et al.  Artificial Intelligence and Data Mining: Algorithms and Applications , 2013 .

[5]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  Lingling An,et al.  Dynamic Clustering of Gene Expression , 2012, ISRN bioinformatics.

[7]  Ying Wah Teh,et al.  Iterative big data clustering algorithms: a review , 2016, Softw. Pract. Exp..

[8]  Rosaria Ignaccolo,et al.  Functional zoning for air quality , 2013, Environmental and Ecological Statistics.

[9]  Ying Wah Teh,et al.  Time-series clustering - A decade review , 2015, Inf. Syst..

[10]  Maciej Haranczyk,et al.  Comparison of Nonbinary Similarity Coefficients for Similarity Searching, Clustering and Compound Selection , 2009, J. Chem. Inf. Model..

[11]  Ying Wah Teh,et al.  Big Data Clustering: A Review , 2014, ICCSA.

[12]  François G. Meyer,et al.  Spatiotemporal clustering of fMRI time series in the spectral domain , 2005, Medical Image Anal..

[13]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[14]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[15]  Yu Ping,et al.  A Dynamic Fuzzy Cluster Algorithm for Time Series , 2013 .

[16]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[17]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[18]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[19]  David Taniar,et al.  Computational Science and Its Applications - ICCSA 2005, International Conference, Singapore, May 9-12, 2005, Proceedings, Part I , 2005, ICCSA.

[20]  Ziv Bar-Joseph,et al.  Clustering short time series gene expression data , 2005, ISMB.

[21]  M. Cugmas,et al.  On comparing partitions , 2015 .

[22]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Mark J. Embrechts,et al.  On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification , 2009, ICANN.

[24]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[25]  Anup Dewanji,et al.  Time-Series Analyses of Air Pollution and Mortality in the United States: A Subsampling Approach , 2012, Environmental health perspectives.

[26]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[27]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[28]  Anil K. Jain,et al.  A self-organizing network for hyperellipsoidal clustering (HEC) , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[29]  Georg Peters,et al.  Some refinements of rough k-means clustering , 2006, Pattern Recognit..

[30]  Jarke J. van Wijk,et al.  Cluster and Calendar Based Visualization of Time Series Data , 1999, INFOVIS.

[31]  Tieniu Tan,et al.  Comparison of Similarity Measures for Trajectory Clustering in Outdoor Surveillance Scenes , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[32]  David Taniar,et al.  Computational Science and Its Applications – ICCSA 2014 , 2014, Lecture Notes in Computer Science.

[33]  Margaret H. Dunham,et al.  Data Mining: Introductory and Advanced Topics , 2002 .

[34]  Jun Ye,et al.  Sparse geostatistical analysis in clustering fMRI time series , 2011, Journal of Neuroscience Methods.

[35]  Balazs Feil,et al.  Cluster Analysis for Data Mining and System Identification , 2007 .

[36]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[37]  Wei Lee Woon,et al.  An ensemble model for day-ahead electricity demand time series forecasting , 2013, e-Energy '13.

[38]  Chad L. Myers,et al.  Comparison of Profile Similarity Measures for Genetic Interaction Networks , 2013, PloS one.

[39]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[40]  Wolfgang Kastner,et al.  Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns , 2013 .

[41]  Yan Feng,et al.  Localized FCM Clustering with Spatial Information for Medical Image Segmentation and Bias Field Estimation , 2013, Int. J. Biomed. Imaging.

[42]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[43]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[44]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[45]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[46]  Jesús Ariel Carrasco-Ochoa,et al.  Assessment and prediction of air quality using fuzzy logic and autoregressive models , 2012 .

[47]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.