Fuzzy-Based Information Decomposition for Incomplete and Imbalanced Data Learning

Class imbalance and missing values are two critical problems in pattern classification. Researchers have proposed a number of techniques to address each of the problems. However, no single technique can solve the two problems. Moreover, the simple combination approach cannot accurately classify the imbalanced data with missing values. This paper develops a fuzzy-based information decomposition (FID) method to simultaneously address these two problems. In the new FID method, the two different problems are treated as the same missing data estimation problem. In particular, FID rebalances the training data by creating synthetic samples for the minority class. The proposed scheme has two steps: weighting and recovery. In the weighting step, the weights produced by the fuzzy membership functions are used to quantify the contribution of the observed data to the missing estimation. In the recovery step, missing values will be estimated by taking into account different contribution of the observed data. To evaluate the performance of the new FID method, a large number of classification experiments have been carried out on 27 well-known datasets. The results show that the FID method significantly outperforms other ten state-of-the-art individual methods and eight combination methods when missing values and imbalanced data present at the same time.

[1]  Esther-Lydia Silva-Ramírez,et al.  Single imputation with multilayer perceptron and multiple imputation combining multilayer perceptron and k-nearest neighbours for monotone patterns , 2015, Appl. Soft Comput..

[2]  Luis E. Zárate,et al.  Techniques for Missing Value Recovering in Imbalanced Databases: Application in a Marketing Database with Massive Missing Data , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[3]  Huang Chong-fu,et al.  Principle of information diffusion , 1997 .

[4]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[5]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .

[6]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[7]  Pedro Abreu,et al.  Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values , 2015, Comput. Biol. Medicine.

[8]  Daniel S. Yeung,et al.  Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems , 2015, IEEE Transactions on Cybernetics.

[9]  Tim Menzies,et al.  The \{PROMISE\} Repository of Software Engineering Databases. , 2005 .

[10]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[11]  Abraham Kandel,et al.  Complex fuzzy logic , 2003, IEEE Trans. Fuzzy Syst..

[12]  J. Zupan,et al.  Self-organizing maps for imputation of missing data in incomplete data matrices , 2015 .

[13]  Tony R. Martinez,et al.  Distribution-balanced stratified cross-validation for accuracy estimation , 2000, J. Exp. Theor. Artif. Intell..

[14]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[15]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  Mehran Amiri,et al.  Missing data imputation using fuzzy-rough methods , 2016, Neurocomputing.

[17]  Lior Rokach,et al.  Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem , 2017, Neurocomputing.

[18]  Taghi M. Khoshgoftaar,et al.  Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[19]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[20]  H. D. de Vet,et al.  Missing Data: A Systematic Review of How They Are Reported and Handled , 2012, Epidemiology.

[21]  Mengjie Zhang,et al.  A Genetic Programming-Based Imputation Method for Classification with Missing Data , 2016, EuroGP.

[22]  Sattar Hashemi,et al.  To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques , 2016, IEEE Transactions on Knowledge and Data Engineering.

[23]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[24]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[25]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[26]  Yongdong Zhang,et al.  Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets , 2016, Neurocomputing.

[27]  Peter A. Flach,et al.  Cost-sensitive boosting algorithms: Do we really need them? , 2016, Machine Learning.

[28]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[29]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[30]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[31]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[32]  Ke Lu,et al.  Missing data imputation by K nearest neighbours based on grey relational structure and mutual information , 2015, Applied Intelligence.

[33]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[34]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[35]  Yuxing Peng,et al.  A subspace ensemble framework for classification with high dimensional missing data , 2017, Multidimens. Syst. Signal Process..

[36]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[37]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[38]  Chongfu Huang,et al.  Principle of information diffusion , 1997, Fuzzy Sets Syst..

[39]  Quan Pan,et al.  Adaptive imputation of missing values for incomplete pattern classification , 2016, Pattern Recognit..

[40]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2013, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[42]  M. Mostafizur Rahman,et al.  Cluster Based Under-Sampling for Unbalanced Cardiovascular Data , 2013 .

[43]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[45]  Claudio Cobelli,et al.  A Bayesian Network for Probabilistic Reasoning and Imputation of Missing Risk Factors in Type 2 Diabetes , 2015, AIME.

[46]  Jian Gao,et al.  A new sampling method for classifying imbalanced data based on support vector machine ensemble , 2016, Neurocomputing.

[47]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[48]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[49]  Didier Dubois,et al.  Readings in Fuzzy Sets for Intelligent Systems , 1993 .

[50]  Aleksandra Werner,et al.  The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis , 2017, Inf. Sci..