Bots and Gender Profiling using Masking Techniques Notebook for PAN at CLEF 2019

This work describes our proposed solution for the author profiling shared task at PAN 2019. The task consists in identifying whether the author of a Twitter feed is a bot or a human, and, in case of a human, in determining if the author is male or female. Like previous years, the task considers different languages, in this case, English and Spanish. Our proposal focuses on the preprocessing and feature extraction steps; we mainly apply some masking techniques that allow emphasizing the relevant terms by obfuscating the irrelevant ones but keeping information about the structure of the texts. Using this approach we obtained accuracies of 0.92 and 0.81 in the Spanish test set for classifying bots/humans and males/females, respectively; similarly, we obtained accuracy values of 0.91 and 0.82 for the English dataset.

[1]  Tauhid Zaman,et al.  Detecting Influence Campaigns in Social Networks Using the Ising Model , 2018, ArXiv.

[2]  Zhenyu Wu,et al.  Humans and Bots in Internet Chat: Measurement, Analysis, and Automated Classification , 2011, IEEE/ACM Transactions on Networking.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Tomoki Taniguchi,et al.  Text and Image Synergy with Feature Cross Technique for Gender Identification: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[5]  Benno Stein,et al.  Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter , 2018, CLEF.

[6]  Ben Y. Zhao,et al.  Uncovering social network sybils in the wild , 2011, IMC '11.

[7]  Tauhid Zaman,et al.  The Impact of Bots on Opinions in Social Networks , 2018, ArXiv.

[8]  Amos Azaria,et al.  The DARPA Twitter Bot Challenge , 2016, Computer.

[9]  Diana Inkpen,et al.  Gender Identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[10]  Filippo Menczer,et al.  BotOrNot: A System to Evaluate Social Bots , 2016, WWW.

[11]  Paolo Rosso,et al.  Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter , 2019, CLEF.

[12]  Efstathios Stamatatos,et al.  Masking topic‐related information to enhance authorship attribution , 2018, J. Assoc. Inf. Sci. Technol..

[13]  Benno Stein,et al.  TIRA Integrated Research Architecture , 2019, Information Retrieval Evaluation in a Changing World.

[14]  Hongbo Xu,et al.  Adapting Naive Bayes to Domain Adaptation for Sentiment Analysis , 2009, ECIR.

[15]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[16]  Mark Cieliebak,et al.  Word Unigram Weighing for Author Profiling at PAN 2018: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[17]  Manuel Cebrián,et al.  Reducing the Loss of Information through Annealing Text Distortion , 2011, IEEE Transactions on Knowledge and Data Engineering.

[18]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[19]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[20]  Nils Schaetti,et al.  Character-based Convolutional Neural Network and ResNet18 for Twitter Author Profiling: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[21]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[22]  Graciela María de Jesús Ramírez Alonso,et al.  Custom Document Embeddings Via the Centroids Method: Gender Classification in an Author Profiling Task: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[23]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.

[24]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[25]  Rik van Noord,et al.  Using Translated Data to Improve Deep Learning Author Profiling Models: Notebook for PAN at CLEF 2018 , 2018, CLEF.