An Efficient Approach for Instance Selection

Nowadays, the volume of data that is produced challenges our capabilities of converting it in useful knowledge. Due to this, data mining approaches have been applied for extracting useful knowledge from this big data. In order to deal with the increasing size of datasets, techniques for instance selection have been applied for reducing the data to a manageable volume and, consequently, to reduce the computational resources that are necessary to apply data mining approaches. However, most of the proposed approaches for instance selection have a high time complexity and, due to this, they cannot be applied for dealing with big data. In this paper, we propose a novel approach for instance selection called XLDIS. This approach adopts the notion of local density for selecting the most representative instances of each class of the dataset, providing a reasonably low time complexity. The approach was evaluated on 20 well-known datasets used in a classification task, and its performance was compared to those of 6 state-of-the-art algorithms, considering three measures: accuracy, reduction, and effectiveness. All the obtained results show that, in general, the XLDIS algorithm provides the best trade-off between accuracy and reduction.

[1]  Antonio González Muñoz,et al.  Combining instance selection methods based on data characterization: An approach to increase their effectiveness , 2011, Inf. Sci..

[2]  Joel Luis Carbonera,et al.  Extended Ontologies: a Cognitively Inspired Approach , 2015, ONTOBRAS.

[3]  Chien-Hsing Chou,et al.  The Generalized Condensed Nearest Neighbor Rule as A Data Reduction Method , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[4]  M. Slaney,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes] , 2008, IEEE Signal Processing Magazine.

[5]  Joel Luis Carbonera,et al.  Categorical Data Clustering: A Correlation-Based Approach for Unsupervised Attribute Weighting , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[6]  Joel Luis Carbonera,et al.  A Cognitively Inspired Approach for Knowledge Representation and Reasoning in Knowledge-Based Systems , 2015, IJCAI.

[7]  Huan Liu,et al.  On Issues of Instance Selection , 2002, Data Mining and Knowledge Discovery.

[8]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[9]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[10]  William Eberle,et al.  Learning to detect representative data for large scale instance selection , 2015, J. Syst. Softw..

[11]  Joel Luis Carbonera,et al.  A Cognition-inspired Knowledge Representation Approach for Knowledge-based Interpretation Systems , 2015, ICEIS.

[12]  G. Gates,et al.  The reduced nearest neighbor rule (Corresp.) , 1972, IEEE Trans. Inf. Theory.

[13]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[14]  Joel Luis Carbonera,et al.  A Density-Based Approach for Instance Selection , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[15]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[16]  Joel Luis Carbonera,et al.  A Novel Density-Based Approach for Instance Selection , 2016, 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI).

[17]  Hadi Sadoghi Yazdi,et al.  IRAHC: Instance Reduction Algorithm using Hyperrectangle Clustering , 2015, Pattern Recognit..

[18]  Antonio González Muñoz,et al.  Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective , 2015, Pattern Recognit..

[19]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[20]  Joel Luis Carbonera,et al.  An Entropy-Based Subspace Clustering Algorithm for Categorical Data , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[21]  Q. Henry Wu,et al.  A class boundary preserving algorithm for data condensation , 2011, Pattern Recognit..

[22]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[23]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.