A hybrid approach for classification of rare class data

Learning of rare class data is a challenging problem in field of classification process. A rare class or imbalanced class learning is the common problem faced by many real-world applications, because of this many researcher work focused on this issue. Rare class data always generate wrong results because of overwhelming accuracy of minority class by majority class. There are lots of methods been proposed to handle imbalanced class or rare class or skew class problem. This paper proposes a hybrid method, i. e. classification- and clustering-based method, solving rare class problem. This proposed hybrid method uses k-means, ensemble and divide and merge methods. This method tries to improve detection rate of every class. For experimental work, the proposed method is tested on real datasets. The experimental results show that proposed method works well as compared with other algorithms.

[1]  Marzuki Khalid,et al.  Learning with imbalanced datasets using fuzzy ARTMAP-based neural network models , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[2]  Ali A. Ghorbani,et al.  IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS 1 Toward Credible Evaluation of Anomaly-Based Intrusion-Detection Methods , 2022 .

[3]  Zhang Chunkai,et al.  A new sampling approach for classification of imbalanced data sets with high density , 2014, 2014 International Conference on Big Data and Smart Computing (BIGCOMP).

[4]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[5]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, ICMLA 2007.

[6]  Taghi M. Khoshgoftaar,et al.  Mining Data with Rare Events: A Case Study , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[7]  Kezhi Mao,et al.  Learning imbalanced classes in the presence of concept growth , 2013, 2013 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS).

[8]  Jinglu Hu,et al.  A quasi-linear SVM combined with assembled SMOTE for imbalanced data classification , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[9]  Herna L. Viktor,et al.  The PerfSim Algorithm for Concept Drift Detection in Imbalanced Data , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[10]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Jacek M. Zurada,et al.  Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance , 2008, Neural Networks.

[12]  David P. Williams,et al.  Mine Classification With Imbalanced Data , 2009, IEEE Geoscience and Remote Sensing Letters.

[13]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  Hien M. Nguyen,et al.  Online learning from imbalanced data streams , 2011, 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[16]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[17]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[18]  Kaizhu Huang,et al.  Imbalanced learning with a biased minimax probability machine , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[19]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Silvio Romero de Lemos Meira,et al.  Comparative Study of Clustering Techniques for the Organization of Software Repositories , 2007 .

[21]  Zhaolei Zhang,et al.  Modifying kernels using label information improves SVM classification performance , 2007, ICMLA 2007.

[22]  Michael R. Lyu,et al.  Learning classifiers from imbalanced data based on biased minimax probability machine , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[23]  Ramakant Nevatia,et al.  Event Detection and Analysis from Video Streams , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Diane J. Cook,et al.  RACOG and wRACOG: Two Probabilistic Oversampling Techniques , 2015, IEEE Transactions on Knowledge and Data Engineering.

[25]  Chidchanok Lursinsap,et al.  A modified error function for imbalanced dataset classification problem , 2012, 2012 7th International Conference on Computing and Convergence Technology (ICCCT).

[26]  Xiang Yu,et al.  Imbalanced data classification algorithm based on hybrid model , 2012, 2012 International Conference on Machine Learning and Cybernetics.

[27]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[29]  Gerald Schaefer,et al.  An evaluation of classifier ensembles for class imbalance problems , 2013, 2013 International Conference on Informatics, Electronics and Vision (ICIEV).

[30]  Yi-Hung Liu,et al.  Total margin based adaptive fuzzy support vector machines for multiview face recognition , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[31]  Zhenyu He,et al.  A new sampling approach for classification of imbalanced data sets with high density , 2014, BigComp.

[32]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[33]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[34]  Tao Xiang,et al.  Finding Rare Classes: Active Learning with Generative and Discriminative Models , 2013, IEEE Transactions on Knowledge and Data Engineering.

[35]  Ajith Abraham,et al.  A new weighted rough set framework for imbalance class distribution , 2010, 2010 International Conference of Soft Computing and Pattern Recognition.

[36]  Xin Yao,et al.  Resampling-Based Ensemble Methods for Online Class Imbalance Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[37]  Zeping Yang,et al.  An Active Under-Sampling Approach for Imbalanced Data Classification , 2012, 2012 Fifth International Symposium on Computational Intelligence and Design.

[38]  Longin Jan Latecki,et al.  Improving SVM Classification on Imbalanced Data Sets in Distance Spaces , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[39]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[40]  Pablo M. Granitto,et al.  REPMAC: A New Hybrid Approach to Highly Imbalanced Classification Problems , 2008, 2008 Eighth International Conference on Hybrid Intelligent Systems.

[41]  C Hemalatha,et al.  A New Data Mining Based Network Intrusion Detection Model , 2012 .

[42]  Rashedur M. Rahman,et al.  Data mining approaches to predict final grade by overcoming class imbalance problem , 2014, 2014 17th International Conference on Computer and Information Technology (ICCIT).

[43]  Yi Lu Murphey,et al.  OAHO: an Effective Algorithm for Multi-Class Learning from Imbalanced Data , 2007, 2007 International Joint Conference on Neural Networks.

[44]  Byoung-Tak Zhang,et al.  Ensemble Learning Based on Active Example Selection for Solving Imbalanced Data Problem in Biomedical Data , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[45]  Edward Y. Chang,et al.  KBA: kernel boundary alignment considering imbalanced data distribution , 2005, IEEE Transactions on Knowledge and Data Engineering.

[46]  Haixun Wang,et al.  A Low-Granularity Classifier for Data Streams with Concept Drifts and Biased Class Distribution , 2007, IEEE Transactions on Knowledge and Data Engineering.

[47]  Xin Yao,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Relationships between Diversity of Classification Ensembles and Single-class Performance Measures , 2022 .

[48]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[49]  David E. Goldberg,et al.  Facetwise Analysis of XCS for Problems With Class Imbalances , 2009, IEEE Transactions on Evolutionary Computation.

[50]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[51]  Cynthia Rudin,et al.  Online coordinate boosting , 2008, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[52]  Jianbo Shi,et al.  Detecting unusual activity in video , 2004, CVPR 2004.

[53]  Kapil Keshao Wankhade,et al.  A fast and light classifier for data streams , 2010, Evol. Syst..

[54]  Byoung-Tak Zhang,et al.  Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Thach Huy Nguyen,et al.  Cost-Xensitive XCS Classifier System Addressing Imbalance Problems , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[56]  John Yearwood,et al.  A Hybrid Feature Selection With Ensemble Classification for Imbalanced Healthcare Data: A Case Study for Brain Tumor Diagnosis , 2016, IEEE Access.

[57]  Claudia Diamantini,et al.  Bayes Vector Quantizer for Class-Imbalance Problem , 2009, IEEE Transactions on Knowledge and Data Engineering.

[58]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[59]  Zhi-Bo Zhu,et al.  Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis , 2010 .

[60]  Robert Sabourin,et al.  Iterative Boolean combination of classifiers in the ROC space: An application to anomaly detection with HMMs , 2010, Pattern Recognit..

[61]  M. Dolores del Castillo,et al.  A multistrategy approach for digital text categorization from imbalanced documents , 2004, SKDD.

[62]  Antônio de Pádua Braga,et al.  Novel Cost-Sensitive Approach to Improve the Multilayer Perceptron Performance on Imbalanced Data , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[63]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[64]  Xu Zhou,et al.  Imbalanced extreme support vector machine , 2012, 2012 International Conference on Machine Learning and Cybernetics.

[65]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[66]  Chang Ouk Kim,et al.  An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data , 2015, IEEE Transactions on Semiconductor Manufacturing.

[67]  Yuan-chin Ivan Chang,et al.  Meta-learning for imbalanced data and classification ensemble in binary classification , 2009, Neurocomputing.

[68]  Wei Lee Woon,et al.  Handling class imbalance in customer behavior prediction , 2014, 2014 International Conference on Collaboration Technologies and Systems (CTS).

[69]  Kishan G. Mehrotra,et al.  An improved algorithm for neural network classification of imbalanced training sets , 1993, IEEE Trans. Neural Networks.

[70]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[71]  Xin Yao,et al.  Dynamic Sampling Approach to Training Neural Networks for Multiclass Imbalance Classification , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[72]  Zili Zhang,et al.  Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications , 2014, IEEE Transactions on Cybernetics.

[73]  Byron C. Wallace,et al.  Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them) , 2012, 2012 IEEE 12th International Conference on Data Mining.

[74]  Gerald Schaefer,et al.  Combining one-class classifiers for imbalanced classification of breast thermogram features , 2013, 2013 Fourth International Workshop on Computational Intelligence in Medical Imaging (CIMI).

[75]  Bao-Gang Hu,et al.  A New Strategy of Cost-Free Learning in the Class Imbalance Problem , 2014, IEEE Transactions on Knowledge and Data Engineering.

[76]  Yanqing Zhang,et al.  Multiclass SVM with ramp loss for imbalanced data classification , 2012, 2012 IEEE International Conference on Granular Computing.

[77]  Seong-hun Park,et al.  Large Imbalance Data Classification Based on MapReduce for Traffic Accident Prediction , 2014, 2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[78]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[79]  Yu-Fang Chung,et al.  Detect Rare Events via MICE Algorithm with Optimal Threshold , 2013, 2013 Seventh International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[80]  Kotagiri Ramamohanarao,et al.  Using emerging patterns and decision trees in rare-class classification , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[81]  Dazhe Zhao,et al.  A novel cost sensitive neural network ensemble for multiclass imbalance data learning , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[82]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[83]  Q. Henry Wu,et al.  Association Rule Mining-Based Dissolved Gas Analysis for Fault Diagnosis of Power Transformers , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[84]  Jingrui He,et al.  Rare Category Characterization , 2010, 2010 IEEE International Conference on Data Mining.

[85]  Jui Hsi Fu,et al.  Certainty-Enhanced Active Learning for Improving Imbalanced Data Classification , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[86]  Chao-Ton Su,et al.  An Evaluation of the Robustness of MTS for Imbalanced Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[87]  Kevin Kok Wai Wong,et al.  Enhancing classification performance of multi-class imbalanced data using the OAA-DB algorithm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[88]  W. Yassin,et al.  Intrusion detection based on K-Means clustering and Naïve Bayes classification , 2011, 2011 7th International Conference on Information Technology in Asia.

[89]  Ming-Ju Wu,et al.  Wafer Map Failure Pattern Recognition and Similarity Ranking for Large-Scale Data Sets , 2015, IEEE Transactions on Semiconductor Manufacturing.

[90]  Hsiao-Yun Huang,et al.  Imbalanced data classification using random subspace method and SMOTE , 2012, The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems.

[91]  Haibo He,et al.  SERA: Selectively recursive approach towards nonstationary imbalanced stream data mining , 2009, 2009 International Joint Conference on Neural Networks.

[92]  Jose Miguel Puerta,et al.  Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets , 2011, Expert Syst. Appl..

[93]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).