Hate Speech and Offensive Language Detection: A New Feature Set with Filter-Embedded Combining Feature Selection

Social media has changed the world and play an important role in people lives. Social media platforms like Twitter, Facebook and YouTube create a new dimension of communication by providing channels to express and exchange ideas freely. Although the evolution brings numerous benefits, the dynamic environment and the allowable of anonymous posts could expose the uglier side of humanity. Irresponsible people would abuse the freedom of speech by aggressively express opinion or idea that incites hatred. This study performs hate speech and offensive language detection. The problem of this task is there is no clear boundary between hate speech and offensive language. In this study, a selected new features set is proposed for detecting hate speech and offensive language. Using Twitter dataset, the experiments are performed by considering the combination of word n-gram and enhanced syntactic n-gram. To reduce the feature set, filter-embedded combining feature selection is used. The experimental results indicate that the combination of word n-gram and enhanced syntactic n-gram with feature selection to classify the data into three classes: hate speech, offensive language or neither could give good performance. The result reaches 91% for accuracy and the averages of precision, recall and F1.

[1]  Chih-Wen Chen,et al.  Ensemble feature selection in medical datasets: Combining filter, wrapper, and embedded feature selection results , 2020, Expert Syst. J. Knowl. Eng..

[2]  Jitendra Virmani,et al.  Detection of Hate Speech and Offensive Language in Twitter Data Using LSTM Model , 2020, Advances in Intelligent Systems and Computing.

[3]  Prasenjit Majumder,et al.  Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages , 2019, FIRE.

[4]  Kosisochukwu Judith Madukwe,et al.  The Thin Line Between Hate and Profanity , 2019, Australasian Conference on Artificial Intelligence.

[5]  Noel Crespi,et al.  A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media , 2019, COMPLEX NETWORKS.

[6]  Matteo Bonotti,et al.  Introduction: Hate, Offence and Free Speech in a Changing World , 2019, Ethical Theory and Moral Practice.

[7]  Ziqi Zhang,et al.  Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter , 2018, Semantic Web.

[8]  Aditya Gaydhani,et al.  Detecting Hate Speech and Offensive Language on Twitter using Machine Learning: An N-gram and TFIDF based Approach , 2018, ArXiv.

[9]  Sérgio Nunes,et al.  A Survey on Automatic Detection of Hate Speech in Text , 2018, ACM Comput. Surv..

[10]  David Robinson,et al.  Hate Speech Detection on Twitter: Feature Engineering v.s. Feature Selection , 2018, ESWC.

[11]  Chi-Bin Cheng,et al.  Identifying and Categorising Profane Words in Hate Speech , 2018, ICCDA.

[12]  Tomoaki Ohtsuki,et al.  Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection , 2018, IEEE Access.

[13]  Matthew Leighton Williams,et al.  The Enemy Among Us: Detecting Hate Speech with Threats Based 'Othering' Language Embeddings , 2018 .

[14]  Mauro Conti,et al.  All You Need is "Love": Evading Hate Speech Detection , 2018, ArXiv.

[15]  Shervin Malmasi,et al.  Challenges in discriminating profanity from hate speech , 2017, J. Exp. Theor. Artif. Intell..

[16]  Shervin Malmasi,et al.  Detecting Hate Speech in Social Media , 2017, RANLP.

[17]  Vasudeva Varma,et al.  Deep Learning for Hate Speech Detection in Tweets , 2017, WWW.

[18]  Michael Wiegand,et al.  A Survey on Hate Speech Detection using Natural Language Processing , 2017, SocialNLP@EACL.

[19]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[20]  Richard Stephens,et al.  Does Emotional Arousal Influence Swearing Fluency? , 2017, Journal of Psycholinguistic Research.

[21]  Björn Ross,et al.  Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis , 2016, ArXiv.

[22]  Ziqi Zhang Hate Speech Detection Using a Convolution-LSTM Based Deep Neural Network , 2017 .

[23]  Remi van Trijp A Computational Construction Grammar for English , 2017, AAAI Spring Symposia.

[24]  I. Batyrshin,et al.  Algorithm for Extraction of Subtrees of a Sentence Dependency Parse Tree , 2017 .

[25]  Zeerak Waseem,et al.  Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter , 2016, NLP+CSS@EMNLP.

[26]  Dirk Hovy,et al.  Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter , 2016, NAACL.

[27]  Christopher D. Manning,et al.  Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks , 2016, LREC.

[28]  Joel R. Tetreault,et al.  Abusive Language Detection in Online User Content , 2016, WWW.

[29]  Pete Burnap,et al.  Us and them: identifying cyber hate on Twitter across multiple protected characteristics , 2016, EPJ Data Science.

[30]  Chunting Zhou,et al.  Representation Learning for Natural Language Processing , 2023 .

[31]  Efstathios Stamatatos,et al.  Syntactic N-grams as machine learning features for natural language processing , 2014, Expert Syst. Appl..

[32]  Efstathios Stamatatos,et al.  Syntactic Dependency-Based N-grams as Classification Features , 2012, MICAI.

[33]  Timothy Jay,et al.  The pragmatics of swearing , 2008 .

[34]  Jaakko Leino,et al.  Word Orders and Construction Grammars , 2006 .

[35]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[36]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.