Clustering Relational Data Based on Randomized Propositionalization

Clustering of relational data has so far received a lot less attention than classification of such data. In this paper we investigate a simple approach based on randomized propositionalization, which allows for applying standard clustering algorithms like KMeans to multirelational data. We describe how random rules are generated and then turned into boolean-valued features. Clustering generally is not straightforward to evaluate, but preliminary experimental results on a number of standard ILP datasets show promising results. Clusters generated without class information usually agree well with the true class labels of cluster members, i.e. class distributions inside clusters generally differ significantly from the global class distributions. The two-tiered algorithm described shows good scalability due to the randomized nature of the first step and the availability of efficient propositional clustering algorithms for the second step.

[1]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[2]  Nada Lavrac,et al.  Propositionalization-based relational subgroup discovery with RSD , 2006, Machine Learning.

[3]  Mathias Kirsten,et al.  Relational Distance-Based Clustering , 1998, ILP.

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Stephen Muggleton,et al.  Inverse entailment and progol , 1995, New Generation Computing.

[6]  Francesco Camastra,et al.  A Novel Kernel Method for Clustering , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Stefan Wrobel,et al.  Relational Instance-Based Learning with Lists and Terms , 2001, Machine Learning.

[8]  Alan Hutchinson,et al.  Metrics on Terms and Clauses , 1997, ECML.

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Melanie Hilario,et al.  Kernels over Relational Algebra Structures , 2005, PAKDD.

[11]  Luc De Raedt,et al.  Kernels and Distances for Structured Data , 2008 .

[12]  Melanie Hilario,et al.  Distances and (Indefinite) Kernels for Sets of Objects , 2006, Sixth International Conference on Data Mining (ICDM'06).

[13]  J. R. Quinlan Learning Logical Definitions from Relations , 1990 .

[14]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[15]  Petra Perner,et al.  Advances in Data Mining , 2002, Lecture Notes in Computer Science.

[16]  Dietrich Wettschereck,et al.  Relational Instance-Based Learning , 1996, ICML.

[17]  Peter A. Flach,et al.  Propositionalization approaches to relational data mining , 2001 .

[18]  Ashwin Srinivasan,et al.  Warmr: a data mining tool for chemical data , 2001, J. Comput. Aided Mol. Des..