Identification Of Bot Accounts In Twitter Using 2D CNNs On User-generated Contents

The number of accounts that autonomously publish contents on the web is growing fast, and it is very common to encounter them, especially on social networks. They are mostly used to post ads, false information, and scams that a user might run into. Such an account is called bot, an abbreviation of robot (a.k.a. social bots, or sybil accounts). In order to support the end user in deciding where a social network post comes from, bot or a real user, it is essential to automatically identify these accounts accurately and notify the end user in time. In this work, we present a model of classification of social network accounts in humans or bots starting from a set of one hundred textual contents that the account has published, in particular on Twitter platform. When an account of a real user has been identified, we performed an additional step of classification to carry out its gender. The model was realized through a combination of convolutional and dense neural networks on textual data represented by word embedding vectors. Our architecture was trained and evaluated on the data made available by the PAN Bots and Gender Profiling challenge at CLEF 2019, which provided annotated data in both English and Spanish. Considered as the evaluation metric the accuracy of the system, we obtained a score of 0.9182 for the classification Bot vs. Humans, 0.7973 for Male vs. Female on the English language. Concerning the Spanish language, similar results were obtained. A score of 0.9156 for the classification Bot vs. Humans, 0.7417 for Male vs. Female, has been earned. We consider these results encouraging, and this allows us to propose our model as a good starting point for future researches about the topic when no other descriptive details about the account are available. In order to support future development and the replicability of results, the source code of the proposed model is available on the following GitHub repository: https://github.com/marcopoli/Identification-ofTwitter-bots-using-CNN

[1]  Samuel C. Woolley,et al.  Algorithms, bots, and political communication in the US 2016 election: The challenge of automated political communication for election law and administration , 2018 .

[2]  Benno Stein,et al.  Plagiarism Detection Without Reference Collections , 2006, GfKl.

[3]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[4]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[5]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Paolo Rosso,et al.  Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter , 2019, CLEF.

[8]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[9]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[10]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[11]  Benno Stein,et al.  TIRA Integrated Research Architecture , 2019, Information Retrieval Evaluation in a Changing World.

[12]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[13]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[14]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[15]  Paolo Rosso,et al.  A Low Dimensionality Representation for Language Variety Identification , 2016, CICLing.

[16]  Emilio Ferrara,et al.  Deep Neural Networks for Bot Detection , 2018, Inf. Sci..

[17]  Paolo Rosso,et al.  Convolutional Neural Networks for Authorship Attribution of Short Texts , 2017, EACL.

[18]  David Robinson,et al.  Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network , 2018, ESWC.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[21]  Benno Stein,et al.  Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter , 2018, CLEF.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.