Optimization of Skewed Data Using Sampling-Based Preprocessing Approach

In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machine-learning techniques has become more tedious. While most algorithms focus on major data samples, they ignore the minor class data. Thus, the data-skewing issue is one of the critical problems that need attention of researchers. The paper stresses upon data preprocessing using sampling techniques to overcome the data-skewing problem. Here, three different sampling techniques such as Resampling, SpreadSubSampling, and SMOTE are implemented to reduce this uneven data distribution issue and classified with the K-nearest neighbor algorithm. The performance of classification is evaluated with various performance metrics to determine the efficiency of classification.

[1]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.

[2]  Aboul Ella Hassanien,et al.  Improved diagnosis of Parkinson's disease using optimized crow search algorithm , 2018, Comput. Electr. Eng..

[3]  Prerna Sharma,et al.  The health of things for classification of protein structure using improved grey wolf optimization , 2018, The Journal of Supercomputing.

[4]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[5]  Pradeep Kumar Mallick,et al.  Optimizing Drilling Induced Delamination in GFRP Composites using Genetic Algorithm& Particle Swarm Optimisation , 2018 .

[6]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[7]  Ashish Khanna,et al.  Optimized Binary Bat algorithm for classification of white blood cells , 2019, Measurement.

[8]  Joyce A. Mitchell,et al.  Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery , 2009, J. Biomed. Informatics.

[9]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[10]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[11]  Yok-Yen Nguwi,et al.  An unsupervised self-organizing learning with support vector ranking for imbalanced datasets , 2010, Expert Syst. Appl..

[12]  Sheng Chen,et al.  A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems , 2011, Neurocomputing.

[13]  Ashish Khanna,et al.  Evolutionary algorithms for automatic lung disease detection , 2019, Measurement.

[14]  Wei-Zhen Lu,et al.  Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme. , 2008, The Science of the total environment.

[15]  Deepak Gupta,et al.  Artificial plant optimization algorithm to detect infected leaves using machine learning , 2020, Expert Syst. J. Knowl. Eng..

[16]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[17]  Ali Hamzeh,et al.  DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets , 2012, Data Knowl. Eng..

[18]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[19]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[20]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[21]  Deepak Gupta,et al.  Parkinson's diagnosis using ant-lion optimisation algorithm , 2019 .