A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles

Machine learning has progressed to match human performance, including the field of text classification. However, when training data are imbalanced, classifiers do not perform well. Oversampling is one way to overcome the problem of imbalanced data and there are many oversampling methods that can be conveniently implemented. While comparative researches of oversampling methods on non-text data have been conducted, studies comparing oversampling methods under a unifying framework on text data are scarce. This study finds that while oversampling methods generally improve the performance of classifiers, similarity is an important factor that influences the performance of classifiers on imbalanced and resampled data.

[1]  Ee-Peng Lim,et al.  On strategies for imbalanced text classification using SVM: A comparative study , 2009, Decis. Support Syst..

[2]  Chi-Hyuck Jun,et al.  Comparison of data pre-processing techniques for relaxing class imbalance problem , 2014 .

[3]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[4]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[5]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[6]  Eunjeong Lucy Park,et al.  KoNLPy: Korean natural language processing in Python , 2014 .

[7]  Kazuyuki Murase,et al.  A Novel Synthetic Minority Oversampling Technique for Imbalanced Data Set Learning , 2011, ICONIP.

[8]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[9]  Zhuoyuan Zheng,et al.  Oversampling Method for Imbalanced Classification , 2015, Comput. Informatics.

[10]  Vasile Palade,et al.  Efficient resampling methods for training support vector machines with imbalanced datasets , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[11]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[12]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[13]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[14]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[15]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[16]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[17]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[18]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[19]  A. Martins Probability biases as Bayesian inference , 2006, Judgment and Decision Making.

[20]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[21]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[22]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[23]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[24]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[25]  U. N. Kim,et al.  A Study on the Adjustment of Posterior Probability for Oversampling when the Target is Rare , 2011 .

[26]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[27]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[28]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[29]  Refractor Vision , 2000, The Lancet.

[30]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[31]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[32]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Hien M. Nguyen,et al.  Borderline over-sampling for imbalanced data classification , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.