Analysis of Different Similarity Measure Functions and their Impacts on Shared Nearest Neighbor Clustering Approach

Clustering is a technique of grouping data with analogous data content. In recent years, Density based clustering algorithms especially SNN clustering approach has gained high popularity in the field of data mining. It finds clusters of different size, density, and shape, in the presence of large amount of noise and outliers. SNN is widely used where large multidimensional and dynamic databases are maintained. A typical clustering technique utilizes similarity function for comparing various data items. Previously, many similarity functions such as Euclidean or Jaccard similarity measures have been worked upon for the comparison purpose. In this paper, we have evaluated the impact of four different similarity measure functions upon Shared Nearest Neighbor (SNN) clustering approach and the results were compared subsequently. Based on our analysis, we arrived on a conclusion that Euclidean function works best with SNN clustering approach in contrast to cosine, Jaccard and correlation distance measures function.

[1]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[2]  Lori Bowen Ayre Data Mining for Information Professionals , 2006 .

[3]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[4]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[5]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[6]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[7]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[8]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[9]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[10]  Kazem Taghva,et al.  Effects of Similarity Metrics on Document Clustering , 2010, 2010 Seventh International Conference on Information Technology: New Generations.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[13]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .