k-Nearest Neighbour Using Ensemble Clustering Based on Feature Selection Approach to Learning Relational Data

Due to the growing amount of data generated and stored in relational databases, relational learning has attracted the interest of researchers in recent years. Many approaches have been developed in order to learn relational data. One of the approaches used to learn relational data is Dynamic Aggregation of Relational Attributes (DARA). The DARA algorithm is designed to summarize relational data with one-to-many relations. However, DARA suffers a major drawback when the cardinalities of attributes are very high because the size of the vector space representation depends on the number of unique values that exist for all attributes in the dataset. A feature selection process can be introduced to overcome this problem. These selected features can be further optimized to achieve a good classification result. Several clustering runs can be performed for different values of k to yield an ensemble of clustering results. This paper proposes a two-layered genetic algorithm-based feature selection in order to improve the classification performance of learning relational database using a k-NN ensemble classifier. The proposed method involves the task of omitting less relevant features but retaining the diversity of the classifiers so as to improve the performance of the k-NN ensemble. The result shows that the proposed k-NN ensemble is able to improve the performance of traditional k-NN classifiers.

[1]  Foster Provost,et al.  A Simple Relational Classifier , 2003 .

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[4]  Rayner Alfred,et al.  The Study of Dynamic Aggregation of Relational Attributes on Relational Data Mining , 2007, ADMA.

[5]  Ashwin Srinivasan,et al.  Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction , 1996, Artif. Intell..

[6]  Lakhmi C. Jain,et al.  Designing classifier fusion systems by genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[7]  Svetha Venkatesh,et al.  Learning in imbalanced relational data , 2008, 2008 19th International Conference on Pattern Recognition.

[8]  Rayner Alfred,et al.  Clustering Approach to Generalized Pattern Identification Based on Multi-instanced Objects with DARA , 2007, ADBIS Research Communications.

[9]  Rayner Alfred,et al.  Discretization Numbers for Multiple-Instances Problem in Relational Database , 2007, ADBIS.

[10]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[11]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[12]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[13]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Jian Xu,et al.  Random forest for relational classification with application to terrorist profiling , 2009, 2009 IEEE International Conference on Granular Computing.

[15]  Rayner Alfred,et al.  Dimensionality Reduction in Data Summarization Approach to Learning Relational Data , 2013, ACIIDS.

[16]  Xiang-Qian Ding,et al.  A GA-based feature selection and ensemble learning for high-dimensional datasets , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[17]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[18]  M. Pazzani,et al.  Error Reduction through Learning Multiple Descriptions , 1996, Machine Learning.

[19]  Stephen D. Bay Nearest neighbor classification from multiple feature subsets , 1999, Intell. Data Anal..

[20]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[21]  Anne M. P. Canuto,et al.  A genetic-based approach to features selection for ensembles using a hybrid and adaptive fitness function , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[22]  As Fraser,et al.  Simulation of Genetic Systems by Automatic Digital Computers VII. Effects of Reproductive Ra'l'e, and Intensity of Selection, on Genetic Structure , 1960 .

[23]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[24]  Rayner Alfred,et al.  Optimizing Feature Construction Process for Dynamic Aggregation of Relational Attributes , 2009 .

[25]  Rayner Alfred,et al.  FEATURE TRANSFORMATION: A GENETIC‐BASED FEATURE CONSTRUCTION METHOD FOR DATA SUMMARIZATION , 2010, Comput. Intell..