Toward a new approach to author profiling based on the extraction of statistical features

Recently, author profiling on social media and on online platforms, characterized by a huge volumes of data, has become more than a critical issue. This issue is of increasing interest in various fields related to forensic medicine, security, marketing, education, etc. The main objective of author profiling is to identify the type of writer of the messages, whether it is a human or a bot with a very strong presence. These bots have the task of drawing the attention of browsers to specific events, often used to disseminate incorrect and/or false information. In this work, we offer a new approach to detect these bots and the kind of anonymous perpetrators on these social networks. Our approach, purely statistical, is based on digital features (APSF), extracted from users’ tweets, and on the technique of random forests. A total of 17 stylometry-based features were used to train the model. To assess the performance of our approach, we considered different standard measures, namely accuracy, precision, recall and F1-score. The results obtained show that our approach gives the best performance for both English and Spanish languages. For the English dataset, we achieved an accuracy of 92.45% for the bot detection task and 90.36% for the gender classification; similarly, we obtained accuracy values of 89.68% and 88.88% for the Spanish dataset.

[1]  Ge Cheng,et al.  Forecasting emerging technologies: A supervised learning approach through patent analysis , 2017 .

[2]  Rao Muhammad Adeel Nawab,et al.  Multilingual author profiling on Facebook , 2017, Inf. Process. Manag..

[3]  Mourad Abed,et al.  Discovery and tracking of temporal topics of interest based on belief-function and aging theories , 2018, J. Ambient Intell. Humaniz. Comput..

[4]  Filippo Menczer,et al.  Arming the public with artificial intelligence to counter social bots , 2019, Human Behavior and Emerging Technologies.

[5]  Monika Singh,et al.  Who is Who on Twitter–Spammer, Fake or Compromised Account? A Tool to Reveal True Identity in Real-Time , 2018, Cybern. Syst..

[6]  AbdulMalik S. Al-Salman,et al.  Twitter turing test: Identifying social machines , 2016, Inf. Sci..

[7]  Shuqing Zhang,et al.  Crop classification from full-year fully-polarimetric L-band UAVSAR time-series using the Random Forest algorithm , 2020, Int. J. Appl. Earth Obs. Geoinformation.

[8]  JajodiaSushil,et al.  Detecting Automation of Twitter Accounts , 2012 .

[9]  Mourad Abed,et al.  Possibilistic interest discovery from uncertain information in social networks , 2017, Intell. Data Anal..

[10]  Graça Bressan,et al.  Age Groups Classification in Social Network Using Deep Learning , 2017, IEEE Access.

[11]  Thamar Solorio,et al.  Early author profiling on Twitter using profile features with multi-resolution , 2020, Expert Syst. Appl..

[12]  Ioannis Korkontzelos,et al.  Detection of spam-posting accounts on Twitter , 2018, Neurocomputing.

[13]  Mohamed Nazih Omri,et al.  Approximate matching-based unsupervised document indexing approach: application to biomedical domain , 2020, Scientometrics.

[14]  Mohamed Nazih Omri,et al.  Hidden data states-based complex terminology extraction from textual web data model , 2020, Applied Intelligence.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Mohamed Nazih Omri,et al.  Estimation of a Priori Decision Threshold for Collocations Extraction: An Empirical Study , 2013, Int. J. Inf. Technol. Web Eng..

[17]  Paolo Rosso,et al.  On the impact of emotions on author profiling , 2016, Inf. Process. Manag..

[18]  Amos Azaria,et al.  The DARPA Twitter Bot Challenge , 2016, Computer.

[19]  Emilio Ferrara,et al.  Deep Neural Networks for Bot Detection , 2018, Inf. Sci..

[20]  Amin Salih Mohammed,et al.  An Author Gender Detection Method Using Whale Optimization Algorithm and Artificial Neural Network , 2020, IEEE Access.