Summarizing relational data using semi-supervised genetic algorithm-based clustering techniques

Problem statement: In solving a classification problem in relational data mining, traditional methods, for example, the C4.5 and its variants, usually require data transformations from datasets stored in multiple tables into a single table. Unfortunately, we may loss some information when we join tables with a high degree of one-to-many association. Therefore, data transformation becomes a tedious trial-and-error work and the classification result is often not very promising especially when the number of tables and the degree of one-to-many association are large. Approach: We proposed a genetic semi-supervised clustering technique as a means of aggregating data stored in multiple tables to facilitate the task of solving a classification problem in relational database. This algorithm is suitable for classification of datasets with a high degree of one-to-many associations. It can be used in two ways. One is user-controlled clustering, where the user may control the result of clustering by varying the compactness of the spherical cluster. The other is automatic clustering, where a non-overlap clustering strategy is applied. In this study, we use the latter method to dynamically cluster multiple instances, as a means of aggregating them and illustrate the effectiveness of this method using the semi-supervised genetic algorithm-based clustering technique. Results: It was shown in the experimental results that using the reciprocal of Davies-Bouldin Index for cluster dispersion and the reciprocal of Gini Index for cluster purity, as the fitness function in the Genetic Algorithm (GA), finds solutions with much greater accuracy. The results obtained in this study showed that automatic clustering (seeding), by optimizing the cluster dispersion or cluster purity alone using GA, provides one with good results compared to the traditional k-means clustering. However, the best result can be achieved by optimizing the combination values of both the cluster dispersion and the cluster purity, by putting more weight on the cluster purity measurement. Conclusion: This study showed that semi-supervised genetic algorithm-based clustering techniques can be applied to summarize relational data with more effectively and efficiently.

[1]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[2]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[3]  Ashwin Srinivasan,et al.  Pharmacophore Discovery Using the Inductive Logic Programming System PROGOL , 1998, Machine Learning.

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Bidyut Baran Chaudhuri,et al.  A novel genetic algorithm for automatic clustering , 2004, Pattern Recognit. Lett..

[8]  Rayner Alfred,et al.  A Genetic-Based Feature Construction Method for Data Summarisation , 2008, ADMA.

[9]  Michèle Sebag,et al.  Scalability and efficiency in multi-relational data mining , 2003, SKDD.

[10]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ashwin Srinivasan,et al.  Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction , 1996, Artif. Intell..

[12]  Rayner Alfred,et al.  Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA , 2007, ADBIS Research Communications.

[13]  Stefan Wrobel,et al.  Transformation-Based Learning Using Multirelational Aggregation , 2001, ILP.

[14]  Mathias Kirsten,et al.  Extending K-Means Clustering to First-Order Representations , 2000, ILP.

[15]  Peter A. Flach,et al.  A first-order representation for knowledge discovery and Bayesian classification on relational data , 2007 .

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .