CyberTronics at SemEval-2020 Task 12: Multilingual Offensive Language Identification over Social Media

The SemEval-2020 Task 12 (OffensEval) challenge focuses on detection of signs of offensiveness using posts or comments over social media. This task has been organized for several languages, e.g., Arabic, Danish, English, Greek and Turkish. It has featured three related sub-tasks for English language: sub-task A was to discriminate between offensive and non-offensive posts, the focus of sub-task B was on the type of offensive content in the post and finally, in sub-task C, proposed systems had to identify the target of the offensive posts. The corpus for each of the languages is developed using the posts and comments over Twitter, a popular social media platform. We have participated in this challenge and submitted results for different languages. The current work presents different machine learning and deep learning techniques and analyzes their performance for offensiveness prediction which involves various classifiers and feature engineering schemes. The experimental analysis on the training set shows that SVM using language specific pre-trained word embedding (Fasttext) outperforms the other methods. Our system achieves a macro-averaged F1 score of 0.45 for Arabic language, 0.43 for Greek language and 0.54 for Turkish language.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Yaakov HaCohen-Kerner,et al.  STYLISTIC FEATURE SETS AS CLASSIFIERS OF DOCUMENTS ACCORDING TO THEIR HISTORICAL PERIOD AND ETHNIC ORIGIN , 2010, Appl. Artif. Intell..

[5]  Preslav Nakov,et al.  A Large-Scale Semi-Supervised Dataset for Offensive Language Identification , 2020, ArXiv.

[6]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[7]  Nikola S. Nikolov,et al.  Towards Accurate Detection of Offensive Language in Online Communication in Arabic , 2018, ACLING.

[8]  Çağrı Çöltekin,et al.  A Corpus of Turkish Offensive Language on Social Media , 2020, LREC.

[9]  Leon Derczynski,et al.  Offensive Language and Hate Speech Detection for Danish , 2019, LREC.

[10]  Panagiotis Karampelas,et al.  Detecting Hate Speech Within the Terrorist Argument: A Greek Case , 2018, 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[11]  Walid Magdy,et al.  Abusive Language Detection on Arabic Social Media , 2017, ALW@ACL.

[12]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features , 2007, J. Assoc. Inf. Sci. Technol..

[13]  Nikola S. Nikolov,et al.  Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic , 2018, ACLING.

[14]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[15]  Selma Ayşe Özel,et al.  Detection of cyberbullying on social media messages in Turkish , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[16]  Marcos Zampieri,et al.  Offensive Language Identification in Greek , 2020, LREC.

[17]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[18]  Ahmed Abdelali,et al.  Arabic Offensive Language on Twitter: Analysis and Experiments , 2020, ArXiv.

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Preslav Nakov,et al.  SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , 2020, SemEval@COLING.