Multilingual author profiling on Facebook

Proposed a multilingual (Roman Urdu and English) author profiling corpus of Facebook profiles.Manually developed a bilingual dictionary (Roman Urdu to English) of 7749 entries and translated multilingual corpus using it.Applied 64 stylometry and 11 content based features on multilingual and translated corpora.Best results obtained using word bigram for age and word unigram, character 3 and 8 gram for gender identification. Author profiling is the identification of demographic features of an author by examining his written text. Recently, it has attracted the attention of research community due to its potential applications in forensic, security, marketing, fake profiles identification on online social networking sites, capturing sender of harassing messages etc. We need benchmark corpora to develop and evaluate techniques for author profiling. Majority of the existing corpora are for English and other European languages but not for underresourced South Asian languages, like Roman Urdu (written using English alphabets). Roman Urdu is used in daily communication by a large number of native speakers of Urdu around the world particularly in Facebook posts/comments, Twitter tweets, blogs, chat blogs and SMS messaging. The construction of sentences of Urdu while using alphabets of English transforms the language properties of the text. We aim to investigate the behavior of existing author profiling techniques for multilingual text consisting of English and Roman Urdu, concretely for gender and age identification. We here focus on author profiling on Facebook by (i) developing a multilingual (Roman Urdu and English) corpus, (ii) manually building of a bilingual dictionary for translating Roman Urdu words into English, (iii) modeling existing state-of-the-art author profiling techniques by using content based features (word and character Ngrams) and 64 different stylistic based features (11 lexical word based features, 47 lexical character based features and 6 vocabulary richness measures) for age and gender identification on multilingual and translated corpora, (iv) evaluating and comparing the behavior of above mentioned techniques on multilingual and translated corpora. Our extensive empirical evaluation shows that (i) existing author profiling techniques can be used for multilingual text (Roman Urdu + English) as well as monolingual text (corpus obtained after translating multilingual corpus using bilingual dictionary), (ii) content based methods outperform stylistic based methods for both gender and age identification task and (iii) translation of multilingual corpus to monolingual text does not improve results.

[1]  Andrea Zielinski,et al.  Multilingual analysis of twitter news in support of mass emergency events , 2012, ISCRAM.

[2]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Seda Özmutlu,et al.  Character n-gram application for automatic new topic identification , 2014, Inf. Process. Manag..

[5]  Son Bao Pham,et al.  Author Profiling for Vietnamese Blogs , 2009, 2009 International Conference on Asian Language Processing.

[6]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[7]  Julia Caplan,et al.  Social Media and Politics: Twitter Use in the Second Congressional District of Virginia , 2013 .

[8]  Marcelo Luis Errecalde,et al.  A Spanish text corpus for the author profiling task , 2014 .

[9]  Paolo Rosso,et al.  On the impact of emotions on author profiling , 2016, Inf. Process. Manag..

[10]  Noël Crespi,et al.  Analysis of publicly disclosed information in Facebook profiles , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[11]  Lyle H. Ungar,et al.  Exploring Stylistic Variation with Age and Income on Twitter , 2016, ACL.

[12]  Dominique Estival,et al.  TAT: An Author Profiling Tool with Application to Arabic Emails , 2007, ALTA.

[13]  Paula Buttery,et al.  Predicting Author Age from Weibo Microblog Posts , 2016, LREC.

[14]  Son Bao Pham,et al.  Using Content-Based Features for Author Profiling of Vietnamese Forum Posts , 2016 .

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  Sotiris Kotsiantis,et al.  Text Classification Using Machine Learning Techniques , 2005 .

[17]  Efstathios Stamatatos,et al.  Automatic Text Categorization In Terms Of Genre and Author , 2000, CL.

[18]  Muhammad Shahid,et al.  Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, Decision Tree and KNN classification techniques , 2016, J. King Saud Univ. Comput. Inf. Sci..

[19]  Philip S. Yu,et al.  Language independent gender classification on Twitter , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[20]  Vasudeva Varma,et al.  Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013 , 2013, CLEF.

[21]  Afzal Hammad,et al.  Spam filtering of bi-lingual tweets using machine learning , 2016 .

[22]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[23]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[24]  Ben O'Loughlin,et al.  Social Media Analysis and Public Opinion: The 2010 UK General Election , 2015, J. Comput. Mediat. Commun..

[25]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[26]  Sarah Steiner Gender, Genre, and Writing Style in Formal Written Texts , 2014 .

[27]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[28]  Ted Pedersen,et al.  Age and Gender Prediction on Health Forum Data , 2016, LREC.

[29]  Iqra Ameer,et al.  Identification of Author Personality Traits using Stylistic Features: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[30]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[31]  Gregory J. Park,et al.  Automatic personality assessment through social media language. , 2015, Journal of personality and social psychology.

[32]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[33]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[34]  Patrick Juola,et al.  Industrial Uses for Authorship Analysis , 2015 .

[35]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[36]  M. Zimmer “But the data is already public”: on the ethics of research in Facebook , 2010, Ethics and Information Technology.

[37]  George K. Mikros,et al.  Authorship Attribution in Greek Tweets Using Author's Multilevel N-Gram Profiles , 2013, AAAI Spring Symposium: Analyzing Microtext.

[38]  Florian Michahelles,et al.  Monitoring Trends on Facebook , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[39]  Luís Carriço,et al.  Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services , 2010, MobileHCI 2010.

[40]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[41]  Cédrick Fairon,et al.  A translated corpus of 30,000 French SMS , 2006, LREC.

[42]  Walter Daelemans,et al.  Text-Based Age and Gender Prediction for Online Safety Monitoring , 2015 .

[43]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[44]  Paolo Rosso,et al.  On the Identification of Emotions and Authors' Gender in Facebook Comments on the Basis of their Writing Style , 2013, ESSEM@AI*IA.

[45]  T. Raghunadha Reddy,et al.  A Survey on Authorship Profiling Techniques , 2016 .

[46]  Julia Baquero,et al.  Author Profiling Using Corpus Statistics, Lexicons and Stylistic Features Notebook for PAN at CLEF-2013 , 2013, CLEF.

[47]  Leo Wanner,et al.  Multiple Language Gender Identification for Blog Posts , 2015, CogSci.

[48]  Rohini Srihari,et al.  Analyzing Urdu Social Media for Sentiments using Transfer Learning with Controlled Translations , 2012 .

[49]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[50]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[51]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[52]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[53]  A. Karegowda,et al.  COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO AND CORRELATION BASED FEATURE SELECTION , 2010 .

[54]  Mohib Ullah,et al.  Roman Urdu Opinion Mining System (RUOMiS) , 2015, ArXiv.

[55]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[56]  Kalina Bontcheva,et al.  Topic Models and n-gram Language Models for Author Profiling - Notebook for PAN at CLEF 2015 , 2015, CLEF.

[57]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[58]  Min-Yen Kan Optimizing predictive text entry for short message service on mobile phones 1 , 2005 .

[59]  Alessandro Moschitti,et al.  Multi-lingual opinion mining on YouTube , 2016, Inf. Process. Manag..

[60]  Hammad Afzal,et al.  Opinion analysis of Bi-lingual Event Data from Social Networks , 2013, ESSEM@AI*IA.

[61]  David Yarowsky,et al.  Improving Gender Prediction of Social Media Users via Weighted Annotator Rationales , 2014 .

[62]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[63]  Leo Wanner,et al.  A Semi-Supervised Approach for Gender Identification , 2016, LREC.

[64]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[65]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[66]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[67]  Walter Daelemans,et al.  TwiSty: A Multilingual Twitter Stylometry Corpus for Gender and Personality Profiling , 2016, LREC.

[68]  Johanson,et al.  Memory Suggestibility in Entry-Level ROTC Students , 2006 .

[69]  Sandra Nekesa Barasa,et al.  Language, mobile phones and internet : a study of SMS texting, email, IM and SNS chats in computer mediated communication (CMC) in Kenya , 2010 .

[70]  Banu Diri,et al.  Author Attribution of Turkish Texts by Feature Mining , 2007, ICIC.

[71]  Francisco Rangel Author Profile in Social Media: Identifying Information about Gender, Age, Emotions and beyond , 2013 .

[72]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[73]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[74]  Hammad Afzal,et al.  Towards Creation of Linguistic Resources for Bilingual Sentiment Analysis of Twitter Data , 2014, NLDB.