A Systematic Review on Imbalanced Data Challenges in Machine Learning

In machine learning, the data imbalance imposes challenges to perform data analytics in almost all areas of real-world research. The raw primary data often suffers from the skewed perspective of data distribution of one class over the other as in the case of computer vision, information security, marketing, and medical science. The goal of this article is to present a comparative analysis of the approaches from the reference of data pre-processing, algorithmic and hybrid paradigms for contemporary imbalance data analysis techniques, and their comparative study in lieu of different data distribution and their application areas.

[1]  Li Ming,et al.  Software Defect Prediction: Software Defect Prediction , 2008 .

[2]  David L. Olson,et al.  A support vector machine (SVM) approach to imbalanced datasets of customer responses: comparison with other customer response models , 2012, Service Business.

[3]  Shihong Du,et al.  Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach , 2015 .

[4]  Nic Herndon,et al.  A Study of Domain Adaptation Classifiers Derived From Logistic Regression for the Task of Splice Site Prediction , 2016, IEEE Transactions on NanoBioscience.

[5]  Yuval Elovici,et al.  Detecting unknown malicious code by applying classification techniques on OpCode patterns , 2012, Security Informatics.

[6]  Brett A. Lidbury,et al.  Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines , 2017, BMC Medical Informatics and Decision Making.

[7]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[8]  Azuraliza Abu Bakar,et al.  A review of feature selection techniques in sentiment analysis , 2019, Intell. Data Anal..

[9]  Arpit Singh,et al.  A Survey on Methods for Solving Data Imbalance Problem for Classification , 2015 .

[10]  Sharath Pankanti,et al.  Soft margin keyframe comparison: Enhancing precision of fraud detection in retail surveillance , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[11]  Elhassan At,et al.  Classification of Imbalance Data using Tomek Link(T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method , 2016 .

[12]  Stephen H Bryant,et al.  An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. , 2014, Analytica chimica acta.

[13]  Yunming Ye,et al.  ForesTexter: An efficient random forest algorithm for imbalanced text categorization , 2014, Knowl. Based Syst..

[14]  Yi-Min Huang,et al.  Weighted support vector machine for classification with uneven training class sizes , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[15]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[16]  Ömer Faruk Arar,et al.  Software defect prediction using cost-sensitive neural network , 2015, Appl. Soft Comput..

[17]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[18]  Euntai Kim,et al.  A new weighted approach to imbalanced data classification problem via support vector machine with quadratic cost function , 2011, Expert Syst. Appl..

[19]  MengChu Zhou,et al.  A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification , 2017, IEEE Transactions on Cybernetics.

[20]  Yuan Yan Tang,et al.  Hybrid Sampling with Bagging for Class Imbalance Learning , 2016, PAKDD.

[21]  Jing Liu,et al.  Classifying peer-to-peer applications using imbalanced concept-adapting very fast decision tree on IP data stream , 2013, Peer Peer Netw. Appl..

[22]  Amri Napolitano,et al.  A comparative study of iterative and non-iterative feature selection techniques for software defect prediction , 2014, Inf. Syst. Frontiers.

[23]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[24]  Husanbir Singh Pannu,et al.  Anomaly detection survey for information security , 2017, SIN.

[25]  Roshani Ade,et al.  Logistic Regression Learning Model for Handling Concept Drift with Unbalanced Data in Credit Card Fraud Detection System , 2016 .

[26]  Chang Ouk Kim,et al.  Performance of Machine Learning Algorithms for Class-Imbalanced Process Fault Detection Problems , 2016, IEEE Transactions on Semiconductor Manufacturing.

[27]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[28]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[29]  Hadi Sadoghi Yazdi,et al.  Ensemble of online neural networks for non-stationary and imbalanced data streams , 2013, Neurocomputing.

[30]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[31]  Keith Phalp,et al.  Enhancing network based intrusion detection for imbalanced data , 2008, Int. J. Knowl. Based Intell. Eng. Syst..

[32]  Vladimir Cherkassky,et al.  Development and Evaluation of Cost-Sensitive Universum-SVM , 2015, IEEE Transactions on Cybernetics.

[33]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[34]  Anni Cai,et al.  Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset , 2012, Multimedia Tools and Applications.

[35]  Francesco Sergio Pisani,et al.  An Incremental Ensemble Evolved by using Genetic Programming to Efficiently Detect Drifts in Cyber Security Datasets , 2016, GECCO.

[36]  Safdar Ali,et al.  Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines , 2014, Comput. Methods Programs Biomed..

[37]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[38]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[39]  Godfrey A. Mills,et al.  New Cluster Undersampling Technique for Class Imbalance Learning , 2016 .

[40]  Hien M. Nguyen,et al.  A comparative study on sampling techniques for handling class imbalance in streaming data , 2012, The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems.

[41]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[42]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[43]  Edward Y. Chang,et al.  Statistical learning for effective visual information retrieval , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[44]  Dongmei Zhang,et al.  An ensemble method for unbalanced sentiment classification , 2015, 2015 11th International Conference on Natural Computation (ICNC).

[45]  Marek Lubicz,et al.  Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients , 2014, Appl. Soft Comput..

[46]  Abeed Sarker,et al.  Portable automatic text classification for adverse drug reaction detection via multi-corpus training , 2015, J. Biomed. Informatics.

[47]  Christos Faloutsos,et al.  Toward a Comprehensive Model in Internet Auction Fraud Detection , 2008, Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS 2008).

[48]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[49]  Sumeet Dua,et al.  Data Mining and Machine Learning in Cybersecurity , 2011 .

[50]  Jianguo Liu,et al.  A Hybrid Anomaly Detection Framework in Cloud Computing Using One-Class and Two-Class Support Vector Machines , 2012, ADMA.

[51]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[52]  Taghi M. Khoshgoftaar,et al.  Supervised Neural Network Modeling: An Empirical Investigation Into Learning From Imbalanced Data With Labeling Errors , 2010, IEEE Transactions on Neural Networks.

[53]  Jun Zhang,et al.  Assistant detection of skewed data streams classification in cloud security , 2010, 2010 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[54]  Zhuoyuan Zheng,et al.  Oversampling Method for Imbalanced Classification , 2015, Comput. Informatics.

[55]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[56]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[57]  Lance Chun Che Fung,et al.  Classification of Imbalanced Data by Combining the Complementary Neural Network and SMOTE Algorithm , 2010, ICONIP.

[58]  Xingquan Zhu,et al.  Machine Learning for Android Malware Detection Using Permission and API Calls , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[59]  Mohamed Bekkar,et al.  Imbalanced Data Learning Approaches Review , 2013 .

[60]  David J. Kriegman,et al.  Guess-Averse Loss Functions For Cost-Sensitive Multiclass Boosting , 2014, ICML.

[61]  Ekrem Duman,et al.  A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing , 2016, Neurocomputing.

[62]  Ekrem Duman,et al.  Comparing alternative classifiers for database marketing: The case of imbalanced datasets , 2012, Expert Syst. Appl..

[63]  Liangxiao Jiang,et al.  Randomly selected decision tree for test-cost sensitive learning , 2017, Appl. Soft Comput..

[64]  Dazhe Zhao,et al.  An Optimized Cost-Sensitive SVM for Imbalanced Data Learning , 2013, PAKDD.

[65]  Samarth Sharma,et al.  Prediction of click frauds in mobile advertising , 2015, 2015 Eighth International Conference on Contemporary Computing (IC3).

[66]  Oscar Cordón,et al.  Cost-Sensitive Learning of Fuzzy Rules for Imbalanced Classification Problems Using FURIA , 2014, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[67]  Jun Wang,et al.  Comparison of random forest, support vector machine and back propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage and Chinese vinegar , 2013 .

[68]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[69]  Jing He,et al.  A Classifier Hub for Imbalanced Financial Data , 2016, ADC.

[70]  Ying Ju,et al.  Finding the Best Classification Threshold in Imbalanced Classification , 2016, Big Data Res..

[71]  Aleksandra Werner,et al.  The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis , 2017, Inf. Sci..

[72]  B. Lerner,et al.  On the Classification of a Small Imbalanced Cytogenetic Image Database , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[73]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[74]  Xiaoqing Zhou,et al.  An under-sampling imbalanced learning of data gravitation based classification , 2016, 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD).

[75]  Min Chen,et al.  Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework , 2008, IEEE Transactions on Multimedia.

[76]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[77]  Bo Tang,et al.  A Bayesian Classification Approach Using Class-Specific Features for Text Categorization , 2016, IEEE Transactions on Knowledge and Data Engineering.

[78]  Xingquan Zhu,et al.  iSRD: Spam review detection with imbalanced data distributions , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[79]  Mukaddim Pathan,et al.  Security, Privacy and Trust in Cloud Systems , 2013 .

[80]  Qiang Yang,et al.  Test-cost sensitive naive Bayes classification , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[81]  Jose Garcia Moreno-Torres,et al.  Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis , 2013, Inf. Sci..

[82]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[83]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[84]  Jun Ni,et al.  An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[85]  Yonggwan Won,et al.  Classification of Unbalanced Medical Data with Weighted Regularized Least Squares , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[86]  Verónica Bolón-Canedo,et al.  Distributed feature selection: An application to microarray data classification , 2015, Appl. Soft Comput..

[87]  Mahendra Sahare,et al.  A Review of Multi-Class Classification for Imbalanced Data , 2012 .

[88]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[89]  Steffen Bickel,et al.  Discriminative Learning Under Covariate Shift , 2009, J. Mach. Learn. Res..

[90]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[91]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[92]  Elsayed A. Sallam,et al.  A hybrid network intrusion detection framework based on random forests and weighted k-means , 2013 .

[93]  Ajith Abraham,et al.  Modeling Insurance Fraud Detection Using Imbalanced Data Classification , 2015, NaBIC.

[94]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[95]  David A. Cieslak,et al.  A framework for monitoring classifiers’ performance: when and why failure occurs? , 2009, Knowledge and Information Systems.

[96]  Geoff Jones,et al.  Measurement of data complexity for classification problems with unbalanced data , 2014, Stat. Anal. Data Min..

[97]  C Y Wang,et al.  imDC: an ensemble learning method for imbalanced classification with miRNA data. , 2015, Genetics and molecular research : GMR.

[98]  Vaishali Ganganwar,et al.  An overview of classification algorithms for imbalanced datasets , 2012 .

[99]  Chi-Man Vong,et al.  Post-boosting of classification boundary for imbalanced data using geometric mean , 2017, Neural Networks.

[100]  Qiangwang A Hybrid Sampling SVM Approach to Imbalanced Data Classification , 2014 .

[101]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[102]  Hsinchun Chen,et al.  A comparison of fraud cues and classification methods for fake escrow website detection , 2009, Inf. Technol. Manag..

[103]  Joydeep Ghosh,et al.  Ensembles of $({\alpha})$-Trees for Imbalanced Classification Problems , 2014, IEEE Transactions on Knowledge and Data Engineering.

[104]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[105]  Liqing Zhang,et al.  Credit Card Fraud Detection Using Convolutional Neural Networks , 2016, ICONIP.

[106]  Jie Du,et al.  Postboosting Using Extended G-Mean for Online Sequential Multiclass Imbalance Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[107]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[108]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[109]  Mohamed Abouelenien,et al.  Cluster-based Sampling and Ensemble for Bleeding Detection in Capsule Endoscopy Videos , 2013 .

[110]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[111]  Hadi Sadoghi Yazdi,et al.  Online neural network model for non-stationary and imbalanced data stream classification , 2014, Int. J. Mach. Learn. Cybern..

[112]  Siddhartha Bhattacharyya,et al.  Data mining for credit card fraud: A comparative study , 2011, Decis. Support Syst..

[113]  Zhen Liu,et al.  A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion , 2015, Neurocomputing.

[114]  K. Usha Rani,et al.  Performance of synthetic minority oversampling technique on imbalanced breast cancer data , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[115]  P. Manikandan,et al.  IMBALANCED DATASET CLASSIFICATION AND SOLUTIONS : A REVIEW , 2014 .

[116]  Jiahao Zhang,et al.  Sample cutting method for imbalanced text sentiment classification based on BRC , 2013, Knowl. Based Syst..

[117]  Huaxiang Zhang,et al.  RWO-Sampling: A random walk over-sampling approach to imbalanced data classification , 2014, Inf. Fusion.

[118]  Zhen Ji,et al.  Iterative ensemble feature selection for multiclass classification of imbalanced microarray data , 2016, Journal of Biological Research-Thessaloniki.

[119]  Taghi M. Khoshgoftaar,et al.  Survey of review spam detection using machine learning techniques , 2015, Journal of Big Data.

[120]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[121]  Rushi Longadge,et al.  Class Imbalance Problem in Data Mining Review , 2013, ArXiv.

[122]  Nicolás García-Pedrajas,et al.  A Proposal for Local $k$ Values for $k$ -Nearest Neighbor Rule , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[123]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[124]  Yuan-Hai Shao,et al.  An efficient weighted Lagrangian twin support vector machine for imbalanced data classification , 2014, Pattern Recognit..

[125]  Yufei Xia,et al.  Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending , 2017, Electron. Commer. Res. Appl..

[126]  Harshita Patel,et al.  A Hybrid Weighted Nearest Neighbor Approach to Mine Imbalanced Data , 2016 .

[127]  Ekrem Duman,et al.  A cost-sensitive decision tree approach for fraud detection , 2013, Expert Syst. Appl..

[128]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[129]  Francisco Herrera,et al.  Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data , 2015, Fuzzy Sets Syst..

[130]  A. Elhassan,et al.  Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method , 2017 .

[131]  Kup-Sze Choi,et al.  Heartbeat classification using disease-specific feature selection , 2014, Comput. Biol. Medicine.

[132]  Yan-Ping Zhang,et al.  Cluster-based majority under-sampling approaches for class imbalance learning , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[133]  Jianguo Liu,et al.  AFD: Adaptive failure detection system for cloud computing infrastructures , 2012, 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC).

[134]  Ciza Thomas,et al.  Improving intrusion detection for imbalanced network traffic , 2013, Secur. Commun. Networks.

[135]  Shahram Jafari,et al.  Feature Selection in Imbalance data sets , 2012 .

[136]  Randy H. Moss,et al.  A methodological approach to the classification of dermoscopy images , 2007, Comput. Medical Imaging Graph..

[137]  Fulufhelo Vincent Nelwamondo,et al.  Applying Cost-Sensitive Classification for Financial Fraud Detection under High Class-Imbalance , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[138]  Zhaoyang Qu,et al.  Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization , 2014, TheScientificWorldJournal.

[139]  Simon Fong,et al.  An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets , 2013, DaEng.

[140]  Ahmed K. Elmagarmid,et al.  Learning to identify relevant studies for systematic reviews using random forest and external information , 2015, Machine Learning.

[141]  Yanchun Liang,et al.  A resampling ensemble algorithm for classification of imbalance problems , 2014, Neurocomputing.

[142]  Shahla Mardani,et al.  A new method for occupational fraud detection in process aware information systems , 2013, 2013 10th International ISC Conference on Information Security and Cryptology (ISCISC).

[143]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[144]  Lu Cao,et al.  Imbalanced Data Classification Based on a Hybrid Resampling SVM Method , 2015, 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom).

[145]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[146]  Safdar Ali,et al.  Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data , 2016, Comput. Biol. Medicine.

[147]  Francisco Herrera,et al.  A Compact Evolutionary Interval-Valued Fuzzy Rule-Based Classification System for the Modeling and Prediction of Real-World Financial Applications With Imbalanced Data , 2015, IEEE Transactions on Fuzzy Systems.

[148]  Andrea Esuli,et al.  Distributional Random Oversampling for Imbalanced Text Classification , 2016, SIGIR.

[149]  Hong Yan,et al.  Towards accurate human promoter recognition: a review of currently used sequence features and classification methods , 2009, Briefings Bioinform..

[150]  Xin Yao,et al.  Using Class Imbalance Learning for Software Defect Prediction , 2013, IEEE Transactions on Reliability.