Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling

Data imbalance problems arisen from the accumulated amount of data, especially from big data, have become a challenging issue in recent years. In imbalanced data, those minor data sets probably imply much important patterns. Although there are some approaches for discovering class patterns, an emerging issue is that few of them have been applied to cluster minor patterns. In common, the minor samples are submerged in big data, and they are often ignored and misclassified into major patterns without supervision of training set. Since clustering minorities is an uncertain process, in this paper, we employ model selection and evolutionary computation to solve the uncertainty and concealment of the minor data in imbalanced data clustering. Given data set, model selection is to select a model from a set of candidate models. We select probability models as candidate models because they can solve uncertainty effectively and thereby are well-suited to data imbalance. Considering the difficulty of estimating the models' parameters, we employ evolutionary process to adjust and estimate the optimal parameters. Experimental results show that our proposed approach for clustering imbalanced data has the ability of searching and discovering minor patterns, and can also obtain better performances than many other relevant clustering algorithms in several performance indices.

[1]  Hui Wang,et al.  Soft Sensing as Class-Imbalance Binary Classification - A Lattice Machine Approach , 2014, UCAmI.

[2]  Zhongzhi Shi,et al.  Track on Intelligent Computing and Applications: Selected papers from the 2012 International Workshop on Information, Intelligence and Computing (IWIIC 2012) , 2014, Neurocomputing.

[3]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[4]  Zhen Liu,et al.  A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion , 2015, Neurocomputing.

[5]  Vasile Palade,et al.  FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning , 2010, IEEE Transactions on Fuzzy Systems.

[6]  M. P. Sebastian,et al.  Clustering Biological Data Using Enhanced k-Means Algorithm , 2010 .

[7]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Zhi-Hua Zhou,et al.  Learning Imbalanced Multi-class Data with Optimal Dichotomy Weights , 2013, 2013 IEEE 13th International Conference on Data Mining.

[9]  Lu Chen,et al.  A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[10]  Chidchanok Lursinsap,et al.  Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms , 2015, Neurocomputing.

[11]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[12]  Jiancong Fan,et al.  OPE-HCA: an optimal probabilistic estimation approach for hierarchical clustering algorithm , 2015, Neural Computing and Applications.

[13]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[14]  Sin-Jin Lin,et al.  Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction , 2013, Knowl. Based Syst..

[15]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[16]  Yuval Elovici,et al.  Unknown malcode detection and the imbalance problem , 2009, Journal in Computer Virology.

[17]  Hongjie Jia,et al.  Research of semi-supervised spectral clustering algorithm based on pairwise constraints , 2012, Neural Computing and Applications.

[18]  Zhongzhi Shi,et al.  Editorial: Track on Intelligent Computing and Applications , 2014 .

[19]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[20]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[21]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[22]  Michela Antonelli,et al.  An experimental study on evolutionary fuzzy classifiers designed for managing imbalanced datasets , 2014, Neurocomputing.

[23]  Sanjay Chawla,et al.  On the Statistical Consistency of Algorithms for Binary Classification under Class Imbalance , 2013, ICML.

[24]  Alicia Fernández,et al.  Improving Electric Fraud Detection using Class Imbalance Strategies , 2012, ICPRAM.

[25]  María José del Jesús,et al.  Multi-class Imbalanced Data-Sets with Linguistic Fuzzy Rule Based Classification Systems Based on Pairwise Learning , 2010, IPMU.

[26]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).

[27]  Der-Chiang Li,et al.  A learning method for the class imbalance problem with medical data sets , 2010, Comput. Biol. Medicine.

[28]  Loïc Cerf,et al.  Parameter-free classification in multi-class imbalanced data sets , 2013, Data Knowl. Eng..

[29]  Wei Li,et al.  nsemble-based hybrid probabilistic sampling for imbalanced data earning in lung nodule CAD , 2014 .

[30]  Yi Lu Murphey,et al.  OAHO: an Effective Algorithm for Multi-Class Learning from Imbalanced Data , 2007, 2007 International Joint Conference on Neural Networks.

[31]  James M. Keller,et al.  A possibilistic fuzzy c-means clustering algorithm , 2005, IEEE Transactions on Fuzzy Systems.

[32]  José Martínez Sotoca,et al.  An Empirical Study for the Multi-class Imbalance Problem with Neural Networks , 2008, CIARP.

[33]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[34]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.

[35]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[36]  B. Lindsay Mixture models : theory, geometry, and applications , 1995 .

[37]  Roberto Alejo,et al.  Empirical Analysis of Assessments Metrics for Multi-class Imbalance Learning on the Back-Propagation Context , 2014, ICSI.

[38]  Venkatesh Saligrama,et al.  Spectral Clustering with Unbalanced Data , 2013, 1302.5134.

[39]  Yang Fan,et al.  Exploring of clustering algorithm on class-imbalanced data , 2013, 2013 8th International Conference on Computer Science & Education.

[40]  Roberto Alejo,et al.  A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios , 2013, Pattern Recognit. Lett..