Competitive Author Profiling Using Compression-Based Strategies

Author profiling consists in determining some demographic attributes — such as gender, age, nationality, language, religion, and others — of an author for a given document. This task, which has applications in fields such as forensics, security, or marketing, has been approached from different areas, especially from linguistics and natural language processing, by extracting different types of features from training documents, usually content — and style-based features. In this paper we address the problem by using several compression-inspired strategies that generate different models without analyzing or extracting specific features from the textual content, making them style-oblivious approaches. We analyze the behavior of these techniques, combine them and compare them with other state-of-the-art methods. We show that they can be competitive in terms of accuracy, giving the best predictions for some domains, and they are efficient in time performance.

[1]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[2]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[3]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[4]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[5]  Benno Stein,et al.  Overview of the 4th Author Profiling Task at PAN 2016: Cross-Genre Evaluations , 2016, CLEF.

[6]  Raffaele Giancarlo,et al.  Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment , 2007, BMC Bioinformatics.

[7]  Luiz Eduardo Soares de Oliveira,et al.  Compression and stylometry for author identification , 2009, 2009 International Joint Conference on Neural Networks.

[8]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[9]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[10]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[11]  Mihai Datcu,et al.  Authorship analysis based on data compression , 2014, Pattern Recognit. Lett..

[12]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[13]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[14]  W. Oliveira,et al.  Comparing compression models for authorship attribution. , 2013, Forensic science international.

[15]  Mário A. T. Figueiredo,et al.  Text Classification Using Compression-Based Dissimilarity Measures , 2015, Int. J. Pattern Recognit. Artif. Intell..

[16]  Ian H. Witten,et al.  Text categorization using compression models , 2000, Proceedings DCC 2000. Data Compression Conference.

[17]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[18]  Benno Stein,et al.  Recent Trends in Digital Text Forensics and Its Evaluation - Plagiarism Detection, Author Identification, and Author Profiling , 2013, CLEF.

[19]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[20]  Benno Stein,et al.  TIRA: Configuring, Executing, and Disseminating Information Retrieval Experiments , 2012, 2012 23rd International Workshop on Database and Expert Systems Applications.

[21]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[22]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[23]  Miguel Figueroa,et al.  Competitive learning with floating-gate circuits , 2002, IEEE Trans. Neural Networks.

[24]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[25]  Benno Stein,et al.  Ousting ivory tower research: towards a web framework for providing experiments as a service , 2012, SIGIR '12.

[26]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[27]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[28]  Ning Wu,et al.  On Compression-Based Text Classification , 2005, ECIR.