Bot and Gender Identification in Twitter using Word and Character N-Grams

Automated social media accounts, called bots, gained worldwide considerable importance over the course of the last years. Social bots can have serious implications on our society by swaying political elections or spreading disinformation giving rationale to social bot detection as an emerging research area. Hence, tools and techniques to automatically detect and classify manipulative bots are needed. In this notebook, we describe our system for the author profiling task at PAN 2019 on bot and gender identification on Twitter. The submitted system uses word unigrams and bigrams as well as character n-grams as features. Tweet preprocessing and feature construction were conducted to train a linear Support Vector Machine (SVM) classifier. Our model shows that it is possible to differentiate bots from humans with a (fairly) high accuracy. Additionally, the accuracy shows that our SVM architecture can solidly determine the gender of the author (male or female). Our submitted model achieved an overall accuracy of 0.92 for bot detection on the English dataset and an accuracy of 0.91 for Spanish tweets. Gender can be determined by the accuracy of 0.82 and 0.78 on the English and Spanish corpus, respectively. Our simple model ranked 8th out of 55 competitors.

[1]  Xiao-Ping Zhang,et al.  Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I , 2005, ICIC.

[2]  Senja Pollak,et al.  PAN 2017: Author Profiling - Gender and Language Variety Prediction , 2017, CLEF.

[3]  Paolo Rosso,et al.  Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter , 2019, CLEF.

[4]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[5]  Diana Inkpen,et al.  Gender Identification in Twitter using N-grams and LSA: Notebook for PAN at CLEF 2018 , 2018, CLEF.

[6]  Benno Stein,et al.  Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection , 2019, CLEF.

[7]  Konstantin Beznosov,et al.  Design and analysis of a social botnet , 2013, Comput. Networks.

[8]  Filippo Menczer,et al.  Online Human-Bot Interactions: Detection, Estimation, and Characterization , 2017, ICWSM.

[9]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[10]  Bowman H. Miller,et al.  Profile of a terrorist , 1977 .

[11]  Philip S. Yu,et al.  Language independent gender classification on Twitter , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[12]  Sung-Bae Cho,et al.  Multi-class Cancer Classification with OVR-Support Vector Machines Selected by Naïve Bayes Classifier , 2006, ICONIP.

[13]  Jon Crowcroft,et al.  Of Bots and Humans (on Twitter) , 2017, ASONAM.

[14]  Benno Stein,et al.  TIRA Integrated Research Architecture , 2019, Information Retrieval Evaluation in a Changing World.

[15]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[16]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[17]  Benno Stein,et al.  Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter , 2018, CLEF.

[18]  Jon Crowcroft,et al.  Stweeler: A Framework for Twitter Bot Analysis , 2016, WWW.

[19]  Daniel Gayo-Avello Social Media Won't Free Us , 2017, IEEE Internet Comput..