GSI-UPM at SemEval-2019 Task 5: Semantic Similarity and Word Embeddings for Multilingual Detection of Hate Speech Against Immigrants and Women on Twitter

This paper describes the GSI-UPM system for SemEval-2019 Task 5, which tackles multilingual detection of hate speech on Twitter. The main contribution of the paper is the use of a method based on word embeddings and semantic similarity combined with traditional paradigms, such as n-grams, TF-IDF and POS. This combination of several features is fine-tuned through ablation tests, demonstrating the usefulness of different features. While our approach outperforms baseline classifiers on different sub-tasks, the best of our submitted runs reached the 5th position on the Spanish sub-task A.

[1]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[2]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[3]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[4]  Siegfried Handschuh,et al.  Analysis of cyberbullying tweets in trending world events , 2015, I-KNOW.

[5]  Thanassis Tiropanis,et al.  The problem of identifying misogynist language on Twitter (and other online social spaces) , 2016, WebSci.

[6]  Carlos Angel Iglesias,et al.  A semantic similarity-based perspective of affect lexicons for sentiment analysis , 2019, Knowl. Based Syst..

[7]  Fabrício Benevenuto,et al.  Analyzing the Targets of Hate in Online Social Media , 2016, ICWSM.

[8]  Bing Liu,et al.  Opinion observer: analyzing and comparing opinions on the Web , 2005, WWW '05.

[9]  Julia Hirschberg,et al.  Detecting Hate Speech on the World Wide Web , 2012 .

[10]  Carlos Angel Iglesias,et al.  How Well Do Spaniards Sleep? Analysis of Sleep Disorders Based on Twitter Mining , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[11]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[12]  Yang Xiang,et al.  A Two Phase Deep Learning Model for Identifying Discrimination from Tweets , 2016, EDBT.

[13]  Ying Chen DETECTING OFFENSIVE LANGUAGE IN SOCIAL MEDIAS FOR PROTECTION OF ADOLESCENT ONLINE SAFETY , 2011 .

[14]  Mai ElSherief,et al.  Hate Lingo: A Target-based Linguistic Analysis of Hate Speech in Social Media , 2018, ICWSM.

[15]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[16]  Paolo Rosso,et al.  SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter , 2019, *SEMEVAL.

[17]  David Robinson,et al.  Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network , 2018, ESWC.

[18]  Joel R. Tetreault,et al.  Do Characters Abuse More Than Words? , 2016, SIGDIAL Conference.

[19]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[22]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[23]  Yuzhou Wang,et al.  Locate the Hate: Detecting Tweets against Blacks , 2013, AAAI.

[24]  Henry Lieberman,et al.  Common Sense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying , 2012, TIIS.

[25]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[26]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[27]  Ziqi Zhang,et al.  Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter , 2018, Semantic Web.