A Simple and Efficient Ensemble Classifier Combining Multiple Neural Network Models on Social Media Datasets in Vietnamese

Text classification is a popular topic of natural language processing, which has currently attracted numerous research efforts worldwide. The significant increase of data in social media requires the vast attention of researchers to analyze such data. There are various studies in this field in many languages but limited to the Vietnamese language. Therefore, this study aims to classify Vietnamese texts on social media from three different Vietnamese benchmark datasets. Advanced deep learning models are used and optimized in this study, including CNN, LSTM, and their variants. We also implement the BERT, which has never been applied to the datasets. Our experiments find a suitable model for classification tasks on each specific dataset. To take advantage of single models, we propose an ensemble model, combining the highest-performance models. Our single models reach positive results on each dataset. Moreover, our ensemble model achieves the best performance on all three datasets. We reach 86.96% of F1- score for the HSD-VLSP dataset, 65.79% of F1-score for the UIT-VSMEC dataset, 92.79% and 89.70% for sentiments and topics on the UIT-VSFC dataset, respectively. Therefore, our models achieve better performances as compared to previous studies on these datasets.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Alex Nikolov,et al.  Nikolov-Radivchev at SemEval-2019 Task 6: Offensive Tweet Classification with BERT and Ensembles , 2019, *SEMEVAL.

[3]  Kiet Van Nguyen,et al.  Hate Speech Detection on Vietnamese Social Media Text using the Bidirectional-LSTM Model , 2019, ArXiv.

[4]  Thanh Vu,et al.  HSD Shared Task in VLSP Campaign 2019: Hate Speech Detection for Social Good , 2020, ArXiv.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Tuan-Anh Nguyen,et al.  NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit , 2017, IJCNLP.

[7]  Lakhmi C. Jain,et al.  Recurrent Neural Networks: Design and Applications , 1999 .

[8]  Kiet Van Nguyen,et al.  Comparison Between Traditional Machine Learning Models And Neural Network Models For Vietnamese Hate Speech Detection , 2020, 2020 RIVF International Conference on Computing and Communication Technologies (RIVF).

[9]  Kiet Van Nguyen,et al.  Emotion Recognition for Vietnamese Social Media Text , 2019, PACLING.

[10]  Kiet Van Nguyen,et al.  UIT-HSE at WNUT-2020 Task 2: Exploiting CT-BERT for Identifying COVID-19 Information on the Twitter Social Network , 2020, W-NUT@EMNLP.

[11]  Kiet Van Nguyen,et al.  Deep Learning versus Traditional Classifiers on Vietnamese Students’ Feedback Corpus , 2018, 2018 5th NAFOSTED Conference on Information and Computer Science (NICS).

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Trung-Kien Nguyen,et al.  Vietnamese Word Segmentation with CRFs and SVMs: An Investigation , 2006, PACLIC.

[14]  Kiet Van Nguyen,et al.  Job Prediction: From Deep Neural Network Models to Applications , 2020, 2020 RIVF International Conference on Computing and Communication Technologies (RIVF).

[15]  Kiet Van Nguyen,et al.  UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis , 2018, 2018 10th International Conference on Knowledge and Systems Engineering (KSE).

[16]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[17]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[18]  Chung-Hsien Wu,et al.  LSTM-based Text Emotion Recognition Using Semantic and Emotional Word Vectors , 2018, 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia).

[19]  Yiming Yang,et al.  Deep Learning for Extreme Multi-label Text Classification , 2017, SIGIR.

[20]  Quoc Truong Do,et al.  VAIS Hate Speech Detection System: A Deep Learning based Approach for System Combination , 2019, ArXiv.

[21]  Vishal Batchu,et al.  Predicting the Genre and Rating of a Movie Based on its Synopsis , 2018, PACLIC.

[22]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[23]  Hô Tuòng Vinh,et al.  A Hybrid Approach to Word Segmentation of Vietnamese Texts , 2008, LATA.

[24]  Quang Pham Huu,et al.  Automated Hate Speech Detection on Vietnamese Social Networks , 2019 .

[25]  Kiet Van Nguyen,et al.  Variants of Long Short-Term Memory for Sentiment Analysis on Vietnamese Students’ Feedback Corpus , 2018, 2018 10th International Conference on Knowledge and Systems Engineering (KSE).