Text Generation for Imbalanced Text Classification

The problem of imbalanced data can be frequently found in the real-world data. It leads to the bias of classification models, that is, the models predict most samples as major classes which are often the negative class. In this research, text generation techniques were used to generate synthetic minority class samples to make the text dataset balanced. Two text generation methods: the text generation using Markov Chains and the text generation using Long Short-term Memory (LSTM) networks were applied and compared in the term of ability to improve the performance of imbalanced text classification. Our experimental study is based on LSTM networks classifier. Traditional over-sampling technique was also used as baseline. The study investigated our Thai-language advertisement text dataset from Facebook. According to the increase of recall value, applying of these techniques showed the improvement of an ability to create model predicting more positive samples, which are minority samples. It can be found that the Markov Chains technique outperformed traditional over-sampling and text generation using LSTM in majority of the models.

[1]  Guy Lapalme,et al.  Text generation , 1990 .

[2]  François Pachet,et al.  Markov Constraints for Generating Lyrics with Style , 2012, ECAI.

[3]  Nina Narodytska,et al.  RelGAN: Relational Generative Adversarial Networks for Text Generation , 2019, ICLR.

[4]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[5]  Jonghoon Mo,et al.  A Comparison of Oversampling Methods on Imbalanced Topic Classification of Korean News Articles , 2017 .

[6]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[7]  Anna Rumshisky,et al.  GhostWriter: Using an LSTM for Automatic Rap Lyric Generation , 2015, EMNLP.

[8]  Choochart Haruechaiyasak,et al.  Improving emotion classification in imbalanced YouTube dataset using SMOTE algorithm , 2015, 2015 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[9]  Nagwa M. El-Makky,et al.  Imbalanced Toxic Comments Classification Using Data Augmentation and Deep Learning , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  P. Billingsley,et al.  Statistical Methods in Markov Chains , 1961 .

[12]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[13]  Ajith Abraham,et al.  A Review of Class Imbalance Problem , 2014 .

[14]  Zhi Chen,et al.  Adversarial Feature Matching for Text Generation , 2017, ICML.