On the Classification of Imbalanced Datasets

The Classification of Imbalanced Data Sets have received considerable attention in recent research. In this paper, we present an overview of the problem of imbalanced data sets, explain the most commonly used techniques such as sampling and cost sensitive learning, present some evaluation metrics used on imbalanced data sets, quote some interesting points drawn from various popular and latest research papers related to imbalanced classification problem. This paper does not mention all the available research solutions, but try to give a clear picture of imbalanced data set classification problem and present a brief review of existing solutions on this problem. Here, we consider binary classification problem on imbalanced data sets.

[1]  T.M. Padmaja,et al.  Majority filter-based minority prediction (MFMP): An approach for unbalanced datasets , 2008, TENCON 2008 - 2008 IEEE Region 10 Conference.

[2]  Nikolas P. Galatsanos,et al.  A support vector machine approach for detection of microcalcifications , 2002, IEEE Transactions on Medical Imaging.

[3]  Berkman Sahiner,et al.  Computer aided detection of clusters of microcalcifications on full field digital mammograms. , 2006, Medical physics.

[4]  Maite López-Sánchez,et al.  Adaptive case-based reasoning using retention and forgetting strategies , 2011, Knowl. Based Syst..

[5]  Brijesh Verma,et al.  A computer-aided diagnosis system for digital mammograms based on fuzzy-neural and feature extraction techniques , 2001, IEEE Transactions on Information Technology in Biomedicine.

[6]  Cheng G. Weng,et al.  A New Evaluation Measure for Imbalanced Datasets , 2008, AusDM.

[7]  M. Kallergi Computer-aided diagnosis of mammographic microcalcification clusters. , 2004, Medical physics.

[8]  Chao Chen,et al.  Clustering-based binary-class classification for imbalanced data sets , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[9]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[10]  David A. Cieslak,et al.  A Robust Decision Tree Algorithm for Imbalanced Data Sets , 2010, SDM.

[11]  Dimitris Kanellopoulos,et al.  Handling imbalanced datasets: A review , 2006 .

[12]  Robert M. Nishikawa,et al.  Microcalcification Classification Assisted by Content-Based Image Retrieval for Breast Cancer Diagnosis , 2007, 2007 IEEE International Conference on Image Processing.

[13]  José Martínez Sotoca,et al.  Combined Effects of Class Imbalance and Class Overlap on Instance-Based Classification , 2006, IDEAL.

[14]  Panayiotis E. Pintelas,et al.  Mixture of Expert Agents for Handling Imbalanced Data Sets , 2003 .

[15]  Mantao Xu,et al.  Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding , 2006, 2006 8th international Conference on Signal Processing.

[16]  Gary Weiss,et al.  Does cost-sensitive learning beat sampling for classifying rare classes? , 2005, UBDM '05.

[17]  T. Warren Liao,et al.  Classification of weld flaws with imbalanced class data , 2008, Expert Syst. Appl..

[18]  María José del Jesús,et al.  Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets , 2009, Int. J. Approx. Reason..

[19]  H. S. Sheshadri,et al.  Computer aided decision system for early detection of breast cancer. , 2006, The Indian journal of medical research.

[20]  Victor S. Sheng,et al.  Cost-Sensitive Learning and the Class Imbalance Problem , 2008 .

[21]  Xue-wen Chen,et al.  Pruning support vectors for imbalanced data classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[22]  A. Mushlin,et al.  Estimating the accuracy of screening mammography: a meta-analysis. , 1998, American journal of preventive medicine.

[23]  David A. Cieslak,et al.  Learning Decision Trees for Unbalanced Data , 2008, ECML/PKDD.

[24]  Longin Jan Latecki,et al.  Improving SVM Classification on Imbalanced Data Sets in Distance Spaces , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[25]  Dirk Van den Poel,et al.  Handling class imbalance in customer churn prediction , 2009, Expert Syst. Appl..

[26]  Mario Vento,et al.  Automatic classification of clustered microcalcifications by a multiple expert system , 2003, Pattern Recognit..

[27]  David A. Cieslak,et al.  Hellinger distance decision trees are robust and skew-insensitive , 2011, Data Mining and Knowledge Discovery.

[28]  Berkman Sahiner,et al.  Classification of malignant and benign masses based on hybrid ART2LDA approach , 1999, IEEE Transactions on Medical Imaging.

[29]  Hugo Terashima-Marín,et al.  Evolutionary Neural Networks Applied To The Classification Of Microcalcification Clusters In Digital Mammograms , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[30]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[31]  Jitendra Agrawal,et al.  A New approach for Classification of Highly Imbalanced Datasets using Evolutionary Algorithms , 2011 .

[32]  Hugo Terashima-Marín,et al.  Detection of Microcalcification Clusters in Mammograms Using a Difference of Optimized Gaussian Filters , 2005, ICIAR.

[33]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[34]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[35]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[36]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[37]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[38]  Sun I. Kim,et al.  Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods , 2008, Artif. Intell. Medicine.

[39]  Hamid Soltanian-Zadeh,et al.  Comparison of multiwavelet, wavelet, Haralick, and shape features for microcalcification classification in mammograms , 2004, Pattern Recognit..

[40]  C. Lee Giles,et al.  Active learning for class imbalance problem , 2007, SIGIR.

[41]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[42]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[43]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[44]  Jong Kook Kim,et al.  Statistical textural features for detection of microcalcifications in digitized mammograms , 1999, IEEE Transactions on Medical Imaging.

[45]  Qiang Yang,et al.  Decision trees with minimal costs , 2004, ICML.

[46]  Robert M. Nishikawa,et al.  A study on several Machine-learning methods for classification of Malignant and benign clustered microcalcifications , 2005, IEEE Transactions on Medical Imaging.

[47]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[48]  Paul Sajda,et al.  Learning contextual relationships in mammograms using a hierarchical pyramid neural network , 2002, IEEE Transactions on Medical Imaging.

[49]  N. Karssemeijer,et al.  Computer-assisted reading of mammograms , 2007, European Radiology.

[50]  Dimitrios I. Fotiadis,et al.  Characterization of clustered microcalcifications in digitized mammograms using neural networks and support vector machines , 2005, Artif. Intell. Medicine.

[51]  Yang Wang,et al.  Boosting for Learning Multiple Classes with Imbalanced Class Distribution , 2006, Sixth International Conference on Data Mining (ICDM'06).

[52]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[53]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[54]  G Coppini,et al.  Detection of single and clustered microcalcifications in mammograms using fractals models and neural networks. , 2004, Medical engineering & physics.

[55]  Ralescu Anca,et al.  ISSUES IN MINING IMBALANCED DATA SETS - A REVIEW PAPER , 2005 .

[56]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[57]  Nikolas P. Galatsanos,et al.  A similarity learning approach to content-based image retrieval: application to digital mammography , 2004, IEEE Transactions on Medical Imaging.

[58]  Chao-Ton Su,et al.  An Evaluation of the Robustness of MTS for Imbalanced Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[59]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[60]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[61]  Koji Yamamoto,et al.  Computer-aided diagnosis scheme using a filter bank for detection of microcalcification clusters in mammograms , 2006, IEEE Transactions on Biomedical Engineering.

[62]  José Salvador Sánchez,et al.  Exploring the Performance of Resampling Strategies for the Class Imbalance Problem , 2010, IEA/AIE.

[63]  Dong Wang,et al.  Effective recognition of MCCs in mammograms using an improved neural classifier , 2011, Eng. Appl. Artif. Intell..

[64]  Robert C. Holte,et al.  Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria , 2000, ICML.

[65]  Jonathan M. Garibaldi,et al.  A 'non-parametric' version of the naive Bayes classifier , 2011, Knowl. Based Syst..

[66]  Son Lam Phung,et al.  Learning Pattern Classification Tasks with Imbalanced Data Sets , 2009 .

[67]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .

[68]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[69]  Samuel Oporto-D ´ iaz,et al.  Detection of Microcalcification Clusters in Mammograms Using a Difference of Optimized Gaussian Filters , 2005 .

[70]  Sheng Chen,et al.  A Kernel-Based Two-Class Classifier for Imbalanced Data Sets , 2007, IEEE Transactions on Neural Networks.

[71]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[72]  Raju S. Bapi,et al.  Unbalanced Sequential Data Classification using Extreme Outlier Elimination and Sampling Techniques , 2012 .

[73]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[74]  Gongping Yang,et al.  On the Class Imbalance Problem , 2008, 2008 Fourth International Conference on Natural Computation.

[75]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[76]  Zhi-Hua Zhou,et al.  The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study , 2006, Sixth International Conference on Data Mining (ICDM'06).