INAOE's Participation at PAN'15: Author Profiling task

In this paper, we describe the participation of the Language Technologies Lab of INAOE at PAN 2015. According to the Author Profiling (AP) literature. In this paper we take such discriminative and descriptive information into a new higher level exploiting a combination of discriminative and descriptive representations. For this we use dimensionality reduction techniques on the top of typical discriminative and descriptive textual features for AP task. The main idea is that each representation, using the full feature space, automatically highlights the different stylistic and thematic properties in the documents. Specifically, we propose the joint use of Second Order Attributes (SOA) and Latent Semantic Analysis (LSA) techniques to highlight discriminative and descriptive properties respectively. In order to evaluate our approach, we compare our proposal against a standard Bag-of-Words (BOW), SOA and LSA representations using the PAN 2015 corpus for AP. Experimental results in AP show that the combination of SOA and LSA outperforms the BOW and each individual representation, which gives evidence of its usefulness to predict gender, age and personality profiles. More importantly, according to the PAN 2015 evaluation, the proposed approach are in the top 3 positions in every dataset.

[1]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[2]  Zhongyang Xiong,et al.  Fast text categorization using concise semantic analysis , 2011, Pattern Recognit. Lett..

[3]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[4]  Moshe Koppel,et al.  Determining an author's native language by mining a text for errors , 2005, KDD '05.

[5]  Federica Barbieri Patterns of age-based linguistic variation in American English , 2008 .

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  Derek Hoiem,et al.  Building text features for object image classification , 2009, CVPR.

[8]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[9]  H. V. Halteren,et al.  Linguistic Profiling for Author Recognition and Verification , 2017 .

[10]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[11]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[12]  S. Pham,et al.  Profiling for English Emails , 2007 .

[13]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[14]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[15]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[16]  Hugo Jair Escalante,et al.  INAOE's Participation at PAN'13: Author Profiling Task Notebook for PAN at CLEF 2013 , 2013, CLEF.

[17]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[18]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.