Gender and language-variety Identification with MicroTC

In this notebook, we describe our approach to cope with the Author Profiling task on PAN17 which consists of both gender and language identification for Twitter’s users. We used our MicroTC (μTC) framework as the primary tool to create our classifiers. μTC follows a simple approach to text classification; it converts the problem of text classification to a model selection problem using several simple text transformations, a combination of tokenizers, a term-weighting scheme, and finally, it classifies using a Support Vector Machine. Our approach reaches accuracies of 0.7838, 0.8054, 0.7957, and 0.8538, for gender identification; and for language variety, it achieves 0.8275, 0.9004, 0.9554, and 0.9850. All these, for Arabic, English, Spanish, and Portuguese languages, respectively.

[1]  Paolo Rosso,et al.  A Low Dimensionality Representation for Language Variety Identification , 2016, CICLing.

[2]  Mauro Brunato,et al.  Reactive Search and Intelligent Optimization , 2008 .

[3]  Graham Kendall,et al.  Search Methodologies: Introductory Tutorials in Optimization and Decision Support Techniques , 2013 .

[4]  Benno Stein,et al.  Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter , 2017, CLEF.

[5]  Desislava Zhekova,et al.  CAPS: A Cross-genre Author Profiling System , 2016, CLEF.

[6]  Hugo Jair Escalante,et al.  Discriminative subprofile-specific representations for author profiling in social media , 2015, Knowl. Based Syst..

[7]  Benno Stein,et al.  Overview of PAN'17 - Author Identification, Author Profiling, and Author Obfuscation , 2017, CLEF.

[8]  Paolo Rosso,et al.  Language Variety Identification Using Distributed Representations of Words and Documents , 2015, CLEF.

[9]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[10]  Golnoosh Farnadi,et al.  Cross-Genre Age and Gender Identification in Social Media , 2016, CLEF.

[11]  Malvina Nissim,et al.  GronUP: Groningen User Profiling , 2016, CLEF.

[12]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[13]  Grigori Sidorov,et al.  Adapting Cross-Genre Author Profiling to Language and Corpus , 2016, CLEF.

[14]  Daniela Moctezuma,et al.  A Simple Approach to Multilingual Polarity Classification in Twitter , 2016, Pattern Recognit. Lett..

[15]  Teresa Gonçalves,et al.  Age and Gender Identification using Stacking for Classification , 2016, CLEF.

[16]  Anastasia Krithara,et al.  Author Profiling using Complementary Second Order Attributes and Stylometric Features , 2016, CLEF.

[17]  Timothy Baldwin,et al.  Twitter User Geolocation Using a Unified Text and Network Prediction Model , 2015, ACL.

[18]  Daniela Moctezuma,et al.  An Automated Text Categorization Framework based on Hyperparameter Optimization , 2017, Knowl. Based Syst..

[19]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[20]  Benno Stein,et al.  Overview of PAN'16 - New Challenges for Authorship Analysis: Cross-Genre Profiling, Clustering, Diarization, and Obfuscation , 2016, CLEF.

[21]  H. T. Kung,et al.  Twitter Geolocation and Regional Classification via Sparse Coding , 2015, ICWSM.

[22]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[23]  Marcelo Luis Errecalde,et al.  Profile-based Approach for Age and Gender Identification , 2016, CLEF.

[24]  Daniel Dichiu,et al.  Using Machine Learning Algorithms for Author Profiling In Social Media , 2016, CLEF.