论文信息 - A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter

Pre-processing plays an essential role in disambiguating the meaning of short-texts, not only in applications that classify short-texts but also for clustering and anomaly detection. Pre-processing can have a considerable impact on overall system performance; however, it is less explored in the literature in comparison to feature extraction and classification. This paper analyzes twelve different pre-processing techniques on three pre-classified Twitter datasets on hate speech and observes their impact on the classification tasks they support. It also proposes a systematic approach to text pre-processing to apply different pre-processing techniques in order to retain features without information loss. In this paper, two different word-level feature extraction models are used, and the performance of the proposed package is compared with state-of-the-art methods. To validate gains in performance, both traditional and deep learning classifiers are used. The experimental results suggest that some pre-processing techniques impact negatively on performance, and these are identified, along with the best performing combination of pre-processing techniques.

[1] Avi Arampatzis,et al. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis , 2018, Expert Syst. Appl..

[2] Cody Buntain,et al. A Large Labeled Corpus for Online Harassment Research , 2017, WebSci.

[3] Rashid Mehmood,et al. Automatic Detection and Validation of Smart City Events Using HPC and Apache Spark Platforms , 2019, Smart Infrastructure and Applications.

[4] Owen Rambow,et al. Sentiment Analysis of Twitter Data , 2011 .

[5] Ingmar Weber,et al. Automated Hate Speech Detection and the Problem of Offensive Language , 2017, ICWSM.

[6] Norisma Idris,et al. Toward Tweets Normalization Using Maximum Entropy , 2015, NUT@IJCNLP.

[7] Boi Faltings,et al. A :) Is Worth a Thousand Words: How People Attach Sentiment to Emoticons and Words in Tweets , 2013, 2013 International Conference on Social Computing.

[8] Guandong Xu,et al. What’s Happening Around the World? A Survey and Framework on Event Detection Techniques on Twitter , 2019, Journal of Grid Computing.

[9] Gregory Piatetsky-Shapiro,et al. Summary from the KDD-03 panel: data mining: the next 10 years , 2003, SKDD.

[10] Zhao Jianqiang,et al. Pre-processing Boosting Twitter Sentiment Analysis? , 2015, 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity).

[11] Rashid Mehmood,et al. Enabling Next Generation Logistics and Planning for Smarter Societies , 2017, ANT/SEIT.

[12] Katarzyna Musial,et al. Towards Improved Deep Contextual Embedding for the identification of Irony and Sarcasm , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[13] Usman Qamar,et al. TOM: Twitter opinion mining framework using hybrid classification scheme , 2014, Decis. Support Syst..

[14] Serkan Günal,et al. The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[15] Cícero Nogueira dos Santos,et al. Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts , 2014, COLING.

[16] Usman Naseem,et al. Hybrid Words Representation for Airlines Sentiment Analysis , 2019, Australasian Conference on Artificial Intelligence.

[17] Peter Norvig,et al. Deep Learning with Dynamic Computation Graphs , 2017, ICLR.

[18] Saif Mohammad,et al. Sentiment Analysis of Short Informal Texts , 2014, J. Artif. Intell. Res..

[19] Yong Shi,et al. The Role of Text Pre-processing in Sentiment Analysis , 2013, ITQM.

[20] Guandong Xu,et al. Enhanced Heartbeat Graph for emerging event detection on Twitter using time series networks , 2019, Expert Syst. Appl..

[21] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[22] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..