Custom Document Embeddings Via the Centroids Method: Gender Classification in an Author Profiling Task: Notebook for PAN at CLEF 2018

According to Smart Insights, out of the 7.5 billion persons in total population of the world, there are 4 billion Internet users, and out of those an outstanding 3.19 billion are active social media users. In a report by the U.S. Internet Crime Complaint Center, only in 2016 Identity theft, Extortion and Harassment or violence threads stand out among the most frequently reported cyber-crime events. The Author Profiling (AP) task might be useful to counteract this phenomena by profiling cyber-criminals. AP consists in detecting personal traits of authors within texts (i.e. gender, age, personality). In the current report we describe a method to address the AP problem, which is one of the three shared tasks evaluated, as an exercise in digital text forensics at PAN 2018 within the CLEF conference (Conference and Labs of the Evaluation Forum). Our approach blends Word Embeddings (WE) and the Centroids Method to produce Document Embeddings (DE), that deliver competitive results predicting the gender of authors, over a dataset comprised of text posts from Twitter ©. Specifically, in the testing dataset our proposal achieve an accuracy of 0.78 for English language users, and on average (for English, Spanish and Arabic languages users) it reaches an Accuracy score of 0.77.

[1]  Benno Stein,et al.  Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter , 2018, CLEF.

[2]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[3]  Benno Stein,et al.  Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[4]  Serge Gutwirth,et al.  Profiling the European Citizen, Cross-Disciplinary Perspectives , 2008 .

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Paolo Rosso,et al.  On the Identification of Emotions and Authors' Gender in Facebook Comments on the Basis of their Writing Style , 2013, ESSEM@AI*IA.

[7]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[8]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[9]  Hugo Jair Escalante,et al.  INAOE's Participation at PAN'15: Author Profiling task , 2015, CLEF.

[10]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[11]  Ali Ghodsi,et al.  Improving the Accuracy of Pre-trained Word Embeddings for Sentiment Analysis , 2017, ArXiv.

[12]  Grigori Sidorov,et al.  Improving Feature Representation Based on a Neural Network for Author Profiling in Social Media Texts , 2016, Comput. Intell. Neurosci..

[13]  Stephen Marsland,et al.  Machine Learning: An Algorithmic Perspective, Second Edition , 2014 .

[14]  Iqra Ameer,et al.  Identification of Author Personality Traits using Stylistic Features: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[15]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[16]  Benno Stein,et al.  Overview of PAN 2018 - Author Identification, Author Profiling, and Author Obfuscation , 2018, CLEF.