On similarity measures for cluster analysis in clinical laboratory examination databases

This paper discusses how the conventional similarity measure works on the practical medical data set. The similarity measure used was linear combination of the Mahalanobis distance between numerical attributes and the Hamming distance between nominal attributes. We performed clustering experiments on the meningoencephalitis data set using the similarity measure in conjunction with four types of clustering algorithms: single- and complete-linkage agglomerative hierarchical clustering, Ward's method and rough clustering. Usefulness of the similarity measure was evaluated from the following viewpoints: (1) quality of the generated clusters; and (2) clinical reasonability of the attributes used to generate the high-quality clusters. The results show that the best clusters were obtained using Ward's method where the clinically reasonable attributes were selected. It suggests that this similarity measures would be applicable to the medical data sets.