Supervised Classification of Twitter Accounts Based on Textual Content of Tweets

In our implemented system submitted to the bots and gender profiling task of PAN 2019, we use a two-step binary classification approach in which we classify accounts as being bot or not based on a combination of term occurrences and aggregated statistics fed to a random forest classifier. Accounts classified as human are further distinguished as male or female through a logistic regression classifier taking data-driven function words as input. We obtain highly competetive bot and gender classification accuracies on English (0.96 and 0.84, resepectively) while performing worse on Spanish (0.88 and 0.73, respectively).

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Fredrik Johansson,et al.  Emotion classification of social media posts for estimating people’s reactions to communicated alert messages during crises , 2014, Security Informatics.

[4]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[5]  Benno Stein,et al.  TIRA Integrated Research Architecture , 2019, Information Retrieval Evaluation in a Changing World.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Amos Azaria,et al.  The DARPA Twitter Bot Challenge , 2016, Computer.

[8]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[9]  Fredrik Johansson,et al.  A Semi-automatic Approach for Labeling Large Amounts of Automated and Non-automated Social Media User Accounts , 2015, 2015 Second European Network Intelligence Conference.

[10]  Paolo Rosso,et al.  Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter , 2019, CLEF.

[11]  Filippo Menczer,et al.  Online Human-Bot Interactions: Detection, Estimation, and Characterization , 2017, ICWSM.

[12]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[13]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[16]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.