Multi-Language Neural Network Model with Advance Preprocessor for Gender Classification over Social Media: Notebook for PAN at CLEF 2018

This paper describes approaches for the Author Profiling Shared Task at PAN 2018. The goal was to classify the gender of a Twitter user solely by their tweets. Paper explores a simple and efficient Multi-Language model for gender classification. The approach consists of tweet preprocessing, text representation and classification model construction. The model achieved the best results on the English language with an accuracy of 72.79%; for the Spanish and Arabic languages the accuracy was 72.20% and 64.36%, respectively.

[1]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[2]  Brett Lantz,et al.  Machine learning with R : learn how to use R to apply powerful machine learning methods and gain an insight into real-world applications , 2013 .

[3]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[4]  Malvina Nissim,et al.  GronUP: Groningen User Profiling: Notebook for PAN at CLEF 2016 , 2016 .

[5]  Benno Stein,et al.  Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter , 2018, CLEF.

[6]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[7]  Nikos Pelekis,et al.  DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis , 2017, *SEMEVAL.

[8]  Hinrich Schütze,et al.  Dimensions of meaning , 1992, Proceedings Supercomputing '92.

[9]  Stefan Conrad,et al.  Exploring the Effects of Cross-Genre Machine Learning for Author Profiling in PAN 2016 , 2016, CLEF.

[10]  Daniela Moctezuma,et al.  Gender and language-variety Identification with MicroTC , 2017, CLEF.

[11]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[12]  Benno Stein,et al.  Overview of PAN 2018 - Author Identification, Author Profiling, and Author Obfuscation , 2018, CLEF.

[13]  Matthias Hagen,et al.  Overview of the Author Obfuscation Task at PAN 2018: A New Approach to Measuring Safety , 2018, CLEF.

[14]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[15]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[16]  Enver Yücesan,et al.  Analyzing the Performance of Generalized Hill Climbing Algorithms , 2004, J. Heuristics.

[17]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[22]  Malvina Nissim,et al.  N-GrAM: New Groningen Author-profiling Model , 2017, CLEF.

[23]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[24]  Senja Pollak,et al.  PAN 2017: Author Profiling - Gender and Language Variety Prediction , 2017, CLEF.

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[27]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection , 2018, CLEF.