CrossClus: user-guided multi-relational clustering

Most structured data in real-life applications are stored in relational databases containing multiple semantically linked relations. Unlike clustering in a single table, when clustering objects in relational databases there are usually a large number of features conveying very different semantic information, and using all features indiscriminately is unlikely to generate meaningful results. Because the user knows her goal of clustering, we propose a new approach called CrossClus, which performs multi-relational clustering under user’s guidance. Unlike semi-supervised clustering which requires the user to provide a training set, we minimize the user’s effort by using a very simple form of user guidance. The user is only required to select one or a small set of features that are pertinent to the clustering goal, and CrossClus searches for other pertinent features in multiple relations. Each feature is evaluated by whether it clusters objects in a similar way with the user specified features. We design efficient and accurate approaches for both feature selection and object clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of CrossClus.

[1]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[2]  Philip S. Yu,et al.  CrossMine: efficient classification across multiple database relations , 2004, Proceedings. 20th International Conference on Data Engineering.

[3]  Vagelis Hristidis,et al.  DISCOVER: Keyword Search in Relational Databases , 2002, VLDB.

[4]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[7]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[8]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[9]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[10]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[14]  Luc De Raedt,et al.  Kernels and Distances for Structured Data , 2008 .

[15]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[16]  Mathias Kirsten,et al.  Extending K-Means Clustering to First-Order Representations , 2000, ILP.

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[19]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[20]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[21]  Bart Demoen,et al.  Improving the Efficiency of Inductive Logic Programming Through the Use of Query Packs , 2011, J. Artif. Intell. Res..

[22]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[23]  Mathias Kirsten,et al.  Relational Distance-Based Clustering , 1998, ILP.

[24]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[25]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[26]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[27]  Dietrich Wettschereck,et al.  Relational Instance-Based Learning , 1996, ICML.

[28]  Philip S. Yu,et al.  CrossMine: Efficient Classification Across Multiple Database Relations , 2004, Constraint-Based Mining and Inductive Databases.

[29]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[30]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[31]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..