Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization

In this paper we describe a dynamic normalization process applied to social network multilingual documents (Facebook and Twitter) to improve the performance of the Author profiling task for short texts. After the normalization process, n-grams of characters and n-grams of POS tags are obtained to extract all the possible stylistic information encoded in the documents (emoticons, character flooding, capital letters, references to other users, hyperlinks, hashtags, etc.). Experiments with SVM showed up to 90% of performance.

[1]  Benno Stein,et al.  Overview of the PAN/CLEF 2015 Evaluation Lab , 2015, CLEF.

[2]  Jon Doyle,et al.  Automatic Categorization of Author Gender via N-Gram Analysis , 2005 .

[3]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[4]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[5]  E. Stamatatos Ensemble-based Author Identification Using Character N-grams , 2006 .

[6]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[7]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[8]  George Giannakopoulos,et al.  Author Profiling using Stylometric and Structural Feature Groupings , 2015, CLEF.

[9]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[10]  Azucena Montes Rendón,et al.  Perfilado de autor multilingüe en redes sociales a partir de n-gramas de caracteres y de etiquetas gramaticales , 2016, Linguamática.

[11]  George A. Vouros,et al.  Testing the Use of N-gram Graphs in Summarization Sub-tasks , 2008, TAC.

[12]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[13]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[14]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[15]  Hugo Jair Escalante,et al.  INAOE's Participation at PAN'15: Author Profiling task , 2015, CLEF.

[16]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[17]  Juan D. Velásquez,et al.  Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style , 2013, Expert Syst. Appl..

[18]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[19]  Yoon Hyung Choi,et al.  Age Differences in Online Social Networking: Extending Socioemotional Selectivity Theory to Social Network Sites , 2015, Journal of broadcasting & electronic media.

[20]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[21]  Brendan T. O'Connor,et al.  Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..