eDARA: Ensembles DARA

The ever-growing amount of digital data stored in relational databases resulted in the need for new approaches to extract useful information from these databases. One of those approaches, the DARA algorithm, is designed to transform data stored in relational databases into a vector space representation utilising information retrieval theory. The DARA algorithm has shown to produce improvements over other state-of-the-art approaches. However, the DARA suffers a major drawback when the cardinality of attributes in relations are very high. This is because the size of the vector space representation depends on the number of unique values of all attributes in the dataset. This issue can be solved by reducing the number of features generated from the DARA transformation process by selecting only part of the relevant features to be processed. Since relational data is transformed into a vector space representation in the form of TF-IDF, only numerical values will be used to represent each record. As a result, discretizing these numerical attributes may also reduce the dimensionality of the transformed dataset. When clustering is applied to these datasets, clustering results of various dimensions may be produced as the number of bins used to discretize these numerical attributes is varied. From these clustering results, a final consensus clustering can be applied to produce a single clustering result which is a better fit, in some sense, than the existing clusterings. In this study, an ensemble DARA clustering approach that provides a mechanism to represent the consensus across multiple runs of a clustering algorithm on the relational datasets is proposed.

[1]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[2]  Djoerd Hiemstra,et al.  Ensemble Clustering for Result Diversification , 2012, TREC.

[3]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[4]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Anil K. Jain,et al.  Adaptive clustering ensembles , 2004, ICPR 2004.

[6]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[7]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[8]  Rayner Alfred,et al.  The Study of Dynamic Aggregation of Relational Attributes on Relational Data Mining , 2007, ADMA.

[9]  Vitaliy Tayanov Some questions of consensus building using co-association , 2012 .

[10]  Morteza Analoui,et al.  Solving Cluster Ensemble Problems by Correlation's matrix & GA , 2006, Intelligent Information Processing.

[11]  Mathias Kirsten,et al.  Relational Distance-Based Clustering , 1998, ILP.

[12]  Anil K. Jain,et al.  Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[13]  Sangeeta Ahuja,et al.  Regionalization of River Basins Using Cluster Ensemble , 2012 .

[14]  Hui-lan Luo,et al.  Combining Multiple Clusterings using Information Theory based Genetic Algorithm , 2006, 2006 International Conference on Computational Intelligence and Security.

[15]  Thomas Peltier,et al.  NIST Special Publications , 2003 .

[16]  Jinfeng Yi,et al.  Robust Ensemble Clustering by Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[17]  Rayner Alfred,et al.  Dimensionality Reduction in Data Summarization Approach to Learning Relational Data , 2013, ACIIDS.

[18]  Yuchou Chang,et al.  Unsupervised feature selection using clustering ensembles and population based incremental learning algorithm , 2008, Pattern Recognit..

[19]  Fang Liu,et al.  Spectral Clustering Ensemble Applied to SAR Image Segmentation , 2008, IEEE Transactions on Geoscience and Remote Sensing.

[20]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[21]  Yan Sheng,et al.  Multi-relational Classification Based on the Contribution of Tables , 2009, 2009 International Conference on Artificial Intelligence and Computational Intelligence.

[22]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[23]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ngoc Thanh Nguyen,et al.  Intelligent Information and Database Systems , 2014, Lecture Notes in Computer Science.

[25]  Arno J. Knobbe,et al.  Propositionalisation and Aggregates , 2001, PKDD.

[26]  Ashwin Srinivasan,et al.  Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction , 1996, Artif. Intell..

[27]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[28]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[29]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[30]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[31]  Bo Hu,et al.  SELECTING EFFECTIVE FEATURES AND RELATIONS FOR EFFICIENT MULTI‐RELATIONAL CLASSIFICATION , 2010, Comput. Intell..

[32]  Rayner Alfred,et al.  Discretization Numbers for Multiple-Instances Problem in Relational Database , 2007, ADBIS.

[33]  Vipin Kumar,et al.  Multilevel k-way Hypergraph Partitioning , 2000, VLSI Design.

[34]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[35]  Dietrich Wettschereck,et al.  Relational Instance-Based Learning , 1996, ICML.

[36]  Rayner Alfred,et al.  Optimizing Feature Construction Process for Dynamic Aggregation of Relational Attributes , 2009 .

[37]  Henrik Boström,et al.  Pre-Processing Structured Data for Standard Machine Learning Algorithms by Supervised Graph Propositionalization - A Case Study with Medicinal Chemistry Datasets , 2010, 2010 Ninth International Conference on Machine Learning and Applications.