Fast compressed-based strategies for author profiling of social media texts

Given a text, it may be useful to determine the age, gender, native language, nationality, personality and other demographic attributes of its author. This task is called author profiling, and has been studied by different areas, especially from linguistics and natural language processing, by extracting different content- and style-based features from training documents and then using various machine learning approaches. In this paper we address the author profiling task by using several compression-inspired strategies. More specifically, we generate different models to identify the age and the gender of the author of a given document without analysing or extracting specific features from the textual content, making them style-oblivious approaches. We compare and analyse their behaviour over datasets of different nature. Our results show that by using simple compression-inspired techniques we are able to obtain very competitive results in terms of accuracy and we are orders of magnitude faster for the evaluation phase when compared to other state-of-the-art complex and resource-demanding techniques.

[1]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[2]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[3]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[4]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[5]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[6]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[7]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[8]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[9]  W. Oliveira,et al.  Comparing compression models for authorship attribution. , 2013, Forensic science international.

[10]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[11]  Hugo Jair Escalante,et al.  Discriminative subprofile-specific representations for author profiling in social media , 2015, Knowl. Based Syst..

[12]  Mário A. T. Figueiredo,et al.  Text Classification Using Compression-Based Dissimilarity Measures , 2015, Int. J. Pattern Recognit. Artif. Intell..

[13]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[14]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[15]  Luiz Eduardo Soares de Oliveira,et al.  Compression and stylometry for author identification , 2009, 2009 International Joint Conference on Neural Networks.

[16]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[17]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[18]  Mihai Datcu,et al.  Authorship analysis based on data compression , 2014, Pattern Recognit. Lett..

[19]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.

[20]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[21]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[22]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[23]  Lee Gillam Readability for Author Profiling? Notebook for PAN at CLEF 2013 , 2013, CLEF.