Investigating diversity of clustering methods: An empirical comparison

The paper aims to shed some light on the question why clustering algorithms, despite being quantitative and hence supposedly objective in nature, yield different and varied results. To do that, we took 10 common clustering algorithms and tested them over four known datasets, used in the literature as baselines with agreed upon clusters. One additional method, Binary-Positive, developed by our team, was added to the analysis. The results affirm the unpredictable nature of the clustering process, point to different assumptions taken by different methods. One conclusion of the study is to carefully choose the appropriate clustering method for any given application.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Pat Langley,et al.  A general theory of discrimination learning , 1987 .

[3]  Israel Spiegler,et al.  Data Mining by Means of Binary Representation: A Model for Similarity and Clustering , 2002, Inf. Syst. Frontiers.

[4]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[5]  D. Klahr,et al.  The representation of children's knowledge. , 1978, Advances in child development and behavior.

[6]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[7]  Israel Spiegler,et al.  Evaluating A Positive Attribute Clustering Model for Data Mining , 2003, J. Comput. Inf. Syst..

[8]  Konstantinos Nikolopoulos,et al.  Fortv: Decision Support System for Forecasting Television Viewership , 2003, J. Comput. Inf. Syst..

[9]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10]  Sargur N. Srihari,et al.  Properties of Binary Vector Dissimilarity Measures , 2003 .

[11]  Danny Coomans,et al.  Improvements to the classification performance of RDA , 1993 .

[12]  Israel Spiegler,et al.  Hempel's Raven paradox: a positive approach to cluster analysis , 2000, Comput. Oper. Res..

[13]  P. Langley,et al.  Production system models of learning and development , 1987 .

[14]  R. Siegler Three aspects of cognitive development , 1976, Cognitive Psychology.

[15]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[16]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[17]  Israel Spiegler,et al.  Storage and retrieval considerations of binary data bases , 1985, Inf. Process. Manag..

[18]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[19]  A. Newell Unified Theories of Cognition , 1990 .

[20]  Thomas R. Shultz,et al.  Modeling Cognitive Development on Balance Scale Phenomena , 2004, Machine Learning.

[21]  James L. McClelland Parallel Distributed Processing: Implications for Cognition and Development , 1988 .

[22]  Belur V. Dasarathy,et al.  Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.