Knowledge discovery from imbalanced and noisy data

Class imbalance and labeling errors present significant challenges to data mining and knowledge discovery applications. Some previous work has discussed these important topics, however the relationship between these two issues has not received enough attention. Further, much of the previous work in this domain is fragmented and contradictory, leading to serious questions regarding the reliability and validity of the empirical conclusions. In response to these issues, we present a comprehensive suite of experiments carefully designed to provide conclusive, reliable, and significant results on the problem of learning from noisy and imbalanced data. Noise is shown to significantly impact all of the learners considered in this work, and a particularly important factor is the class in which the noise is located (which, as discussed throughout this work, has very important implications to noise handling). The impacts of noise, however, vary dramatically depending on the learning algorithm and simple algorithms such as naive Bayes and nearest neighbor learners are often more robust than more complex learners such as support vector machines or random forests. Sampling techniques, which are often used to alleviate the adverse impacts of imbalanced data, are shown to improve the performance of learners built from noisy and imbalanced data. In particular, simple sampling techniques such as random undersampling are generally the most effective.

[1]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[2]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[3]  Chengqi Zhang,et al.  Mining Impact-Targeted Activity Patterns in Imbalanced Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Shari Lawrence Pfleeger,et al.  Software metrics (2nd ed.): a rigorous and practical approach , 1997 .

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[7]  David M. Levine,et al.  Intermediate Statistical Methods and Applications: A Computer Package Approach , 1982 .

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[10]  Taghi M. Khoshgoftaar,et al.  Detecting noisy instances with the rule-based classification model , 2005, Intell. Data Anal..

[11]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, ICMLA 2007.

[12]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[13]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[14]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[15]  Taghi M. Khoshgoftaar,et al.  Class noise detection using frequent itemsets , 2006, Intell. Data Anal..

[16]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[17]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[18]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[19]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[20]  Taghi M. Khoshgoftaar,et al.  Experimental perspectives on learning from imbalanced data , 2007, ICML '07.

[21]  Taghi M. Khoshgoftaar,et al.  Classification of Fault-Prone Software Modules: Prior Probabilities, Costs, and Model Evaluation , 1998, Empirical Software Engineering.

[22]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[23]  Claes Wohlin,et al.  Experimentation in software engineering: an introduction , 2000 .

[24]  Wei Xiong,et al.  Fuzzy relevance vector machine for learning from unbalanced data and noise , 2008, Pattern Recognit. Lett..

[25]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[26]  Shari Lawrence Pfleeger,et al.  Software Metrics : A Rigorous and Practical Approach , 1998 .

[27]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[28]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[29]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[30]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[31]  Xindong Wu,et al.  Cost-guided class noise handling for effective cost-sensitive learning , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[32]  Taghi M. Khoshgoftaar,et al.  Comparative Assessment of Software Quality Classification Techniques: An Empirical Case Study , 2004, Empirical Software Engineering.

[33]  Taghi M. Khoshgoftaar,et al.  The necessity of assuring quality in software measurement data , 2004 .

[34]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[35]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[36]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[37]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[38]  Jonathan N. Crook,et al.  Credit Scoring and Its Applications , 2002, SIAM monographs on mathematical modeling and computation.

[39]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[40]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[41]  Zhaohui Wu,et al.  Enhancing Reliability throughout Knowledge Discovery Process , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[42]  Taghi M. Khoshgoftaar,et al.  Skewed Class Distributions and Mislabeled Examples , 2007 .

[43]  D. J. Hand,et al.  Good practice in retail credit scorecard assessment , 2005, J. Oper. Res. Soc..

[44]  Taghi M. Khoshgoftaar,et al.  Identifying learners robust to low quality data , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[45]  Taghi M. Khoshgoftaar,et al.  Enhancing software quality estimation using ensemble-classifier based noise filtering , 2005, Intell. Data Anal..

[46]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[47]  Ling Zhuang,et al.  Reducing performance Bias for Unbalanced Text Mining , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[48]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[49]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[50]  Gustavo E. A. P. A. Batista,et al.  Learning with Class Skews and Small Disjuncts , 2004, SBIA.

[51]  Ian Witten,et al.  Data Mining , 2000 .