Cluster-Based Instance Selection for the Imbalanced Data Classification

Instance selection, often referred to as data reduction, aims at deciding which instances from the training set should be retained for further use during the learning process. Instance selection is the important preprocessing step for many machine leaning tools, especially when the huge data sets are considered. Class imbalance arises, when the number of examples belonging to one class is much greater than the number of examples belonging to another. The paper proposes a cluster-based instance selection approach for the imbalanced data classification. The proposed approach bases on the similarity coefficient between training data instances, calculated for each considered data class independently. Similar instances are grouped into clusters. Next, the instance selection is carried out. The process of instance selection is controlled and carried-out by the team of agents. The proposed approach is validated experimentally. Advantages and main features of the approach are discussed considering results of the computational experiment.

[1]  Ireneusz Czarnowski,et al.  Distributed Learning with Data Reduction , 2011, Trans. Comput. Collect. Intell..

[2]  Piotr Jedrzejowicz,et al.  A New Cluster-based Instance Selection Algorithm , 2011, KES-AMSTA.

[3]  María José del Jesús,et al.  Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets , 2009, Int. J. Approx. Reason..

[4]  Piotr Jedrzejowicz,et al.  Cluster Integration for the Cluster-Based Instance Selection , 2010, ICCCI.

[5]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[6]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[7]  Bo Sun,et al.  Evolutionary under-sampling based bagging ensemble method for imbalanced data classification , 2018, Frontiers of Computer Science.

[8]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[9]  Takeaki Uno Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data , 2009, Knowledge and Information Systems.

[10]  D. Wolpert The Supervised Learning No-Free-Lunch Theorems , 2002 .

[11]  B. John Oommen,et al.  A brief taxonomy and ranking of creative prototype reduction schemes , 2003, Pattern Analysis & Applications.

[12]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[13]  Shaul Markovitch,et al.  The COMPSET Algorithm for Subset Selection , 2005, IJCAI.

[14]  Piotr Jedrzejowicz,et al.  An Approach to Data Reduction and Integrated Machine Classification , 2010, New Generation Computing.

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Piotr Jędrzejowicz,et al.  Social learning algorithm as a tool for solving some difficult scheduling problems , 1999 .

[17]  Bir Bhanu,et al.  Adaptive integrated image segmentation and object recognition , 2000, IEEE Trans. Syst. Man Cybern. Part C.