Analysis Comparison of FastText and Word2vec for Detecting Offensive Language

Twitter is one of the most popular platforms for sharing opinions, ideas, feelings and information. Tweets on Twitter may have language that is similar to that of a group or individual that is considered offensive. One issue brought on by offensive language is cyberbullying, which can encourage someone to ask questions online and use strong language to discuss hate. As a result, many users who interact online, including on social media, run the risk of being made fun out or harassed using abusive language that can affect the users mentally. Thus, identifying offensive language is both a necessary and useful task especially in social media platform. Offensive language can be classified to irony, sarcasm, and figurative. Currently, many research on offensive language detection simply pay attention to one of irony or sarcasm. However, offensive language may contain multi-class classification such as figurative that consist of both irony and sarcasm label. Here, we suggest categorizing tweets into four categories: irony, sarcasm, figurative or not an offensive at all (regular). Specifically, we first identify the relationship between each word using Word2vec and FastText word embedding using Continuous Bag of Words Model (CBOW) and Skip-gram architectures, and then we classify the offensive language label using CNN-BiLSTM, a combination of deep learning approaches Convolutional Neural Networks (CNN) and Bidirectional-Long Short Term Memory (Bi-LSTM) by first examining the impact of hyper-parameters on language classification. The experiment indicates using the Kaggle Dataset, CNN-BiLSTM with Word2vec with CBOW architecture outperforms CNN-BiLSTM with FastText.

[1]  Rosni Lumbantoruan,et al.  TopC-CAMF: A Top Context-Based Matrix Factorization Recommender System , 2022, Jurnal Nasional Teknik Elektro dan Teknologi Informasi.

[2]  Gnana Bharathy,et al.  Cyberbullying Detection: Hybrid Models Based on Machine Learning and Natural Language Processing Techniques , 2021, Electronics.

[3]  P. Nulty,et al.  A Comparative Study on Word Embeddings in Deep Learning for Text Classification , 2020, NLPIR.

[4]  Anna Yohanna The influence of social media on social interactions among students , 2020 .

[5]  Imran Razzak,et al.  A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models , 2020, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[6]  Hamid Reza Reza Sadeghi,et al.  A Method for Improving Unsupervised Intent Detection using Bi-LSTM CNN Cross Attention Mechanism , 2020, 2020 The 4th International Conference on Advances in Artificial Intelligence.

[7]  Ridwan Ilyas,et al.  Pengukuran Kesamaan Semantik Pasangan Kalimat Sitasi Menggunakan Convolutional Neural Network , 2020 .

[8]  Beakcheol Jang,et al.  Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism , 2020, Applied Sciences.

[9]  Jovan Kalajdjieski,et al.  The Ability of Word Embeddings to Capture Word Similarities , 2020 .

[10]  Geetanjali Bihani Longitudinal Comparison of Word Associations in Shallow Word Embeddings , 2020 .

[11]  Devpriya Soni,et al.  Identification of Sarcasm in Textual Data: A Comparative Study , 2019, J. Data Inf. Sci..

[12]  Lei Chen,et al.  I-CARS: An Interactive Context-Aware Recommender System , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[13]  Roberto Navigli,et al.  An overview of word and sense similarity , 2019, Natural Language Engineering.

[14]  Birol Kuyumcu,et al.  An automated new approach in fast text classification (fastText): A case study for Turkish text classification without pre-processing , 2019, NLPIR.

[15]  Faisal Rahutomo,et al.  EVALUASI FITUR WORD2VEC PADA SISTEM UJIAN ESAI ONLINE , 2019 .

[16]  Bin Wang,et al.  Evaluating word embedding models: methods and experimental results , 2019, APSIPA Transactions on Signal and Information Processing.

[17]  Yongli Ren,et al.  D-CARS: A Declarative Context-Aware Recommender System , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[18]  Heri Ramampiaro,et al.  Effective hate-speech detection in Twitter data using recurrent neural networks , 2018, Applied Intelligence.

[19]  Roman Klinger,et al.  An Empirical, Quantitative Analysis of the Differences Between Sarcasm and Irony , 2016, ESWC.

[20]  Rhea Bharal,et al.  Social Media Sentiment Analysis Using CNN-BiLSTM , 2021 .

[21]  Yongli Ren,et al.  Declarative User-Item Profiling Based Context-Aware Recommendation , 2020, ADMA.

[22]  Michael Wiegand,et al.  Overview of GermEval Task 2, 2019 Shared Task on the Identification of Offensive Language , 2019, KONVENS.