Optimizing unbalanced text classification tasks by integrating critical data mining and restricted rewriting techniques

Oversampling technology has been widely used to improve the classification task of unbalanced data. However, unlike structured data, the basic unit of text is words or characters, which can cause oversampling instances in digital space to lose word similarity in semantic space. To solve this problem, use text rewriting to directly generate artificial samples. Unfortunately, existing rewriting techniques usually destroy the grammatical structure and logic of the original text. In this article, we improve and limit some existing text rewriting methods, and propose an effective algorithm to mine feature words in various texts to help complete text rewriting. At the same time, by calculating the similarity between texts, various types of data are divided into key data and non‐key data, and finally different rewriting processes are designed for them. The experimental results of four unbalanced text classification tasks show that our method is superior to the previous text rewriting method, which can improve the classification accuracy of the model by 1.7% to 2.9%, and the AUC can be increased by 0.012 to 0.058. The ablation experiment also explored the effects of various variables and methods on the experimental results.

[1]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yi Lu,et al.  Robust neural learning from unbalanced data samples , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[3]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[4]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[5]  Abhishek Verma,et al.  Deep CNN-LSTM with combined kernels from multiple branches for IMDb review sentiment analysis , 2017, 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON).

[6]  Yijing Li,et al.  Imbalanced text sentiment classification using universal and domain-specific knowledge , 2018, Knowl. Based Syst..

[7]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[8]  Jia Liu,et al.  Character-Level Attention Convolutional Neural Networks for Short-Text Classification , 2019, HCC.

[9]  Taghi M. Khoshgoftaar,et al.  Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[10]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[11]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[12]  Enhong Chen,et al.  Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach , 2019, CIKM.

[13]  Josef Kittler,et al.  A Multiple Expert Approach to the Class Imbalance Problem Using Inverse Random under Sampling , 2009, MCS.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Lei Huang,et al.  Text Classification Research with Attention-based Recurrent Neural Networks , 2018, Int. J. Comput. Commun. Control.

[16]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[17]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[18]  ShangJennifer,et al.  Learning from class-imbalanced data , 2017 .

[19]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[20]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[21]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[22]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[23]  S. M. Hashemy,et al.  Classification of aquifer vulnerability using K-means cluster analysis , 2017 .

[24]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[25]  G. Garrido Cantarero,et al.  [The area under the ROC curve]. , 1996, Medicina clinica.