A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.

[1]  Avi Arampatzis,et al.  A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis , 2018, Expert Syst. Appl..

[2]  Cody Buntain,et al.  A Large Labeled Corpus for Online Harassment Research , 2017, WebSci.

[3]  Rashid Mehmood,et al.  Automatic Detection and Validation of Smart City Events Using HPC and Apache Spark Platforms , 2019, Smart Infrastructure and Applications.

[4]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[5]  Ingmar Weber,et al.  Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[6]  Norisma Idris,et al.  Toward Tweets Normalization Using Maximum Entropy , 2015, NUT@IJCNLP.

[7]  Boi Faltings,et al.  A :) Is Worth a Thousand Words: How People Attach Sentiment to Emoticons and Words in Tweets , 2013, 2013 International Conference on Social Computing.

[8]  Guandong Xu,et al.  What’s Happening Around the World? A Survey and Framework on Event Detection Techniques on Twitter , 2019, Journal of Grid Computing.

[9]  Gregory Piatetsky-Shapiro,et al.  Summary from the KDD-03 panel: data mining: the next 10 years , 2003, SKDD.

[10]  Zhao Jianqiang,et al.  Pre-processing Boosting Twitter Sentiment Analysis? , 2015, 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity).

[11]  Rashid Mehmood,et al.  Enabling Next Generation Logistics and Planning for Smarter Societies , 2017, ANT/SEIT.

[12]  Katarzyna Musial,et al.  Towards Improved Deep Contextual Embedding for the identification of Irony and Sarcasm , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[13]  Usman Qamar,et al.  TOM: Twitter opinion mining framework using hybrid classification scheme , 2014, Decis. Support Syst..

[14]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[15]  Cícero Nogueira dos Santos,et al.  Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts , 2014, COLING.

[16]  Usman Naseem,et al.  Hybrid Words Representation for Airlines Sentiment Analysis , 2019, Australasian Conference on Artificial Intelligence.

[17]  Peter Norvig,et al.  Deep Learning with Dynamic Computation Graphs , 2017, ICLR.

[18]  Saif Mohammad,et al.  Sentiment Analysis of Short Informal Texts , 2014, J. Artif. Intell. Res..

[19]  Yong Shi,et al.  The Role of Text Pre-processing in Sentiment Analysis , 2013, ITQM.

[20]  Guandong Xu,et al.  Enhanced Heartbeat Graph for emerging event detection on Twitter using time series networks , 2019, Expert Syst. Appl..

[21]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[24]  Katarzyna Musial,et al.  Transformer based Deep Intelligent Contextual Embedding for Twitter sentiment analysis , 2020, Future Gener. Comput. Syst..

[25]  Rashid Mehmood,et al.  Sehaa: A Big Data Analytics Tool for Healthcare Symptoms and Diseases Detection Using Twitter, Apache Spark, and Machine Learning , 2020, Applied Sciences.

[26]  Lijuan Wang,et al.  The Role of Pre-processing in Twitter Sentiment Analysis , 2014, ICIC.

[27]  Guandong Xu,et al.  Text Stream to Temporal Network - A Dynamic Heartbeat Graph to Detect Emerging Events on Twitter , 2018, PAKDD.

[28]  Gui Xiaolin,et al.  Deep Convolution Neural Networks for Twitter Sentiment Analysis , 2018, IEEE Access.

[29]  Alexandra Balahur,et al.  Sentiment Analysis in Social Media Texts , 2013, WASSA@NAACL-HLT.

[30]  Saif Mohammad,et al.  NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets , 2013, *SEMEVAL.

[31]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[32]  Johanna D. Moore,et al.  Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[33]  Katarzyna Musial,et al.  Biomedical Named-Entity Recognition by Hierarchically Fusing BioBERT Representations and Deep Contextual-Level Word-Embedding , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[34]  Gui Xiaolin,et al.  Comparison Research on Text Pre-processing Methods on Twitter Sentiment Analysis , 2017, IEEE Access.

[35]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[36]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[37]  Tajinder Singh,et al.  Role of Text Pre-processing in Twitter Sentiment Analysis , 2016 .

[38]  Alessandro Moschitti,et al.  Twitter Sentiment Analysis with Deep Convolutional Neural Networks , 2015, SIGIR.

[39]  Imran Razzak,et al.  EveSense: What Can You Sense from Twitter? , 2020, ECIR.

[40]  Katarzyna Musial,et al.  DICE: Deep Intelligent Contextual Embedding for Twitter Sentiment Analysis , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[41]  Guandong Xu,et al.  Event Detection in Twitter Stream using Weighted Dynamic Heartbeat Graph Approach , 2019, IEEE Comput. Intell. Mag..

[42]  Ibrahim A. Hameed,et al.  Deep Context-Aware Embedding for Abusive and Hate Speech detection on Twitter , 2019, Aust. J. Intell. Inf. Process. Syst..

[43]  Ikuya Yamada,et al.  Enhancing Named Entity Recognition in Twitter Messages Using Entity Linking , 2015, NUT@IJCNLP.

[44]  Harith Alani,et al.  Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold , 2013, ESSEM@AI*IA.