Missing information in imbalanced data stream: fuzzy adaptive imputation approach

From a real-world perspective, missing information is an ordinary scenario in data stream. Generally, missing data generate diverse problems in recognizing the pattern of data (i.e., clustering and classification). Particularly, missing data in data stream is a challenging topic. With imbalanced data, the problem of missing data greatly affects pattern recognition. As a solution to all these issues, this study puts forward an adaptive technique with fuzzy-based information decomposition method, which simultaneously solves the problem of incomplete data and overcomes the imbalanced data stream in a dataset. The main purpose of the proposed fuzzy adaptive imputation approach (FAIA) is to represent the effect of missing values in imbalance data stream and handle the missing data problem in imbalance data stream. FAIA is a single pass method. It considers adaptive selection of intervals based on all observed instances by using the interrelationship of attributes to identify correct interval for computing missing instances. Here, the interrelationship of two attributes means one attribute’s value of an instance depends on another attribute’s value of the same instance. In FAIA, after measuring all interval distances from a certain missing value, the least distance is selected for this missing value. Synthetic data of minority class are generated using the same process of missing value imputation for balancing data that is called oversampling. Instances of the datasets are divided into the chunks in data stream to balance data without any ensemble of previous chunks because missing values may misguide the future chunks. To demonstrate the performance of FAIA, the experiment is divided into three parts: missing data imputation, imbalanced information for offline data for data stream, and imbalanced information with missing value for offline data. Eleven numerical datasets with different dimensions from various repositories are considered for the computing performance of missing data imputation and imbalanced data without data stream. Four different datasets are also used to measure the performance of imbalanced data stream. In maximum measuring cases, the proposed method outperforms.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Haibo He,et al.  SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining , 2009, 2009 International Joint Conference on Neural Networks.

[3]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[5]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[6]  Kay Chen Tan,et al.  Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning , 2017, IEEE Transactions on Cybernetics.

[7]  Ke Lu,et al.  Missing data imputation by K nearest neighbours based on grey relational structure and mutual information , 2015, Applied Intelligence.

[8]  Yang Lu,et al.  Adaptive Chunk-Based Dynamic Weighted Majority for Imbalanced Data Streams With Concept Drift , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Jun Zhang,et al.  Fuzzy-Based Information Decomposition for Incomplete and Imbalanced Data Learning , 2017, IEEE Transactions on Fuzzy Systems.

[10]  Witold Pedrycz,et al.  Cost-Sensitive Weighting and Imbalance-Reversed Bagging for Streaming Imbalanced and Concept Drifting in Electricity Pricing Classification , 2019, IEEE Transactions on Industrial Informatics.

[11]  Hamido Fujita,et al.  Multi-Imbalance: An open-source software for multi-class imbalance learning , 2019, Knowl. Based Syst..

[12]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[13]  Wei Liu,et al.  The Gradual Resampling Ensemble for mining imbalanced data streams with concept drift , 2018, Neurocomputing.

[14]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[15]  Xin Yao,et al.  A Systematic Study of Online Class Imbalance Learning With Concept Drift , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Haibo He,et al.  MuSeRA: Multiple Selectively Recursive Approach towards imbalanced stream data mining , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[17]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[18]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[19]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[20]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[21]  Min Chen,et al.  Selection-based resampling ensemble algorithm for nonstationary imbalanced stream data learning , 2019, Knowl. Based Syst..

[22]  Lígia P. Brás,et al.  Improving cluster-based missing value estimation of DNA microarray data. , 2007, Biomolecular engineering.

[23]  Hui Li,et al.  Class-imbalanced dynamic financial distress prediction based on Adaboost-SVM ensemble combined with SMOTE and time weighting , 2020, Inf. Fusion.

[24]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[25]  Jerzy Stefanowski,et al.  On the Dynamics of Classification Measures for Imbalanced and Streaming Data , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Jinsung Yoon,et al.  Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks , 2017, IEEE Transactions on Biomedical Engineering.

[27]  J. Zupan,et al.  Self-organizing maps for imputation of missing data in incomplete data matrices , 2015 .

[28]  Huang Chong-fu,et al.  Principle of information diffusion , 1997 .

[29]  Mehmed M. Kantardzic,et al.  SOM-based partial labeling of imbalanced data stream , 2017, Neurocomputing.

[30]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.