Comparing relational and non-relational algorithms for clustering propositional data

Cluster detection methods are widely studied in Propositional Data Mining. In this context, data is individually represented as a feature vector. This data has a natural non-relational structure, but can be represented in a relational form through similarity-based network models. In these models, examples are represented by vertices and an edge connects two examples with high similarity. This relational representation allows employing network-based algorithms in Relational Data Mining. Specifically in clustering tasks, these models allow to use community detection algorithms in networks in order to detect data clusters. In this work, we compared traditional non-relational data-based clustering algorithms with clustering detection algorithms based on relational data using measures for community detection in networks. We carried out an exploratory analysis over 23 numerical datasets and 10 textual datasets. Results show that network models can efficiently represent the data topology, allowing their application in cluster detection with higher precision when compared to non-relational methods.

[1]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[3]  A. Arenas,et al.  Data clustering using community detection algorithms , 2010 .

[4]  Naonori Ueda,et al.  Fast approximate similarity search based on degree-reduced neighborhood graphs , 2011, KDD.

[5]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[6]  Hocine Cherifi,et al.  Qualitative Comparison of Community Detection Algorithms , 2011, DICTAP.

[7]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[8]  M E J Newman,et al.  Fast algorithm for detecting community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[10]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[11]  Sergio M. Savaresi,et al.  A comparative analysis on the bisecting K-means and the PDDP clustering algorithms , 2004, Intell. Data Anal..

[12]  Martin Rosvall,et al.  Maps of random walks on complex networks reveal community structure , 2007, Proceedings of the National Academy of Sciences.

[13]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[14]  Pablo M. Granitto,et al.  Clustering gene expression data with a penalized graph-based metric , 2011, BMC Bioinformatics.

[15]  Jun Yu,et al.  Adaptive clustering algorithm for community detection in complex networks. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Maria Cristina Ferreira de Oliveira,et al.  Centrality Measures from Complex Networks in Active Learning , 2009, Discovery Science.

[17]  Jure Leskovec,et al.  Empirical comparison of algorithms for network community detection , 2010, WWW '10.

[18]  Ming-Syan Chen,et al.  Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging , 2005, IEEE Trans. Knowl. Data Eng..

[19]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[20]  Pablo Jensen,et al.  Analysis of community structure in networks of correlated data. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[21]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[22]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[23]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[24]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data clustering based on complex network community detection , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[25]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[26]  Pasi Fränti,et al.  Minimum spanning tree based split-and-merge: A hierarchical clustering method , 2011, Inf. Sci..

[27]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[28]  Robson Motta,et al.  Similarity-based network models and how evaluate them , 2012 .

[29]  M. Cugmas,et al.  On comparing partitions , 2015 .

[30]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..