Cluster Validating Techniques in the Presence of Duplicates

To detect database records containing approximate and exact duplicates because of data entry error or differences in the detailed schemas of records from multiple databases or for some other reasons is an important line of research. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of Silhouette width, Calinski & Harbasz index (pseudo F-statistics) and Baker & Hubert index (γ index) algorithms for exact and approximate duplicates. In this chapter, a comparative study and effectiveness of these three cluster validation techniques which involve measuring the stability of a partition in a data set in the presence of noise, in particular, approximate and exact duplicates are presented. Silhouette width, Calinski & Harbasz index and Baker & Hubert index are calculated before and after inserting the exact and approximate duplicates (deliberately) in the data set. Comprehensive experiments on glass, wine, iris and ruspini database confirms that the Baker & Hubert index is not stable in the presence of approximate duplicates. Moreover, Silhouette width, Calinski and Harbasz index and Baker & Hubert indice do not exceed the original data indice in the presence of approximate duplicates.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[3]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[4]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[5]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[6]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[7]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[10]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[11]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[12]  Shusaku Tsumoto,et al.  Comparison of clustering methods for clinical databases , 2004, Inf. Sci..

[13]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  L. Hubert,et al.  Measuring the Power of Hierarchical Cluster Analysis , 1975 .

[16]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[17]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .