Detecting deception in Online Social Networks

Over the past decade Online Social Networks (OSNs) have been helping hundreds of millions of people develop reliable computer-mediated relations. However, many user profiles in OSNs contain misleading, inconsistent or false information. Existing studies have shown that lying in OSNs is quite widespread, often for protecting a user's privacy. In order for OSNs to continue expanding their role as a communication medium in our society, it is crucial for information posted on OSNs to be trusted. Here we define a set of analysis methods for detecting deceptive information about user genders in Twitter. In addition, we report empirical results with our stratified data set consisting of 174,600 Twitter profiles with a 50-50 breakdown between male and female users. Our automated approach compares gender indicators obtained from different profile characteristics including first name, user name, and layout colors. We establish the overall accuracy of each indicator and the strength of all possible values for each indicator through extensive experimentations with our data set. We define male trending users and female trending users based on two factors, namely the overall accuracy of each characteristic and the relative strength of the value of each characteristic for a given user. We apply a Bayesian classifier to the weighted average of characteristics for each user. We flag for possible deception profiles that we classify as male or female in contrast with a self-declared gender that we obtain independently of Twitter profiles. Finally, we use manual inspections on a subset of profiles that we identify as potentially deceptive in order to verify the correctness of our predictions.

[1]  Jean-Marc Dewaele,et al.  Variation in the Contextuality of Language: An Empirical Measure , 2002 .

[2]  R. Bull,et al.  Detecting Deceit via Analysis of Verbal and Nonverbal Behavior , 2000 .

[3]  Leon Stenneth,et al.  An empirical study of data race detector tools , 2013, 2013 25th Chinese Control and Decision Conference (CCDC).

[4]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[5]  Ernesto Damiani,et al.  P2P-based collaborative spam detection and filtering , 2004, Proceedings. Fourth International Conference on Peer-to-Peer Computing, 2004. Proceedings..

[6]  Steven Myers,et al.  The Nuts and Bolts of a Forum Spam Automator , 2011, LEET.

[7]  Yao-Hua Tan,et al.  Trust and Deception in Virtual Societies , 2001, Springer Netherlands.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Steven Myers,et al.  Prevalence and mitigation of forum spamming , 2011, 2011 Proceedings IEEE INFOCOM.

[10]  Xiang Yan,et al.  Gender Classification of Weblog Authors , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[11]  Jay F. Nunamaker,et al.  Detecting Deception through Linguistic Analysis , 2003, ISI.

[12]  Jeffrey T. Hancock,et al.  Lies in Conversation: An Examination of Deception Using Automated Linguistic Analysis , 2004 .

[13]  Shlomo Argamon,et al.  Stylistic text classification using functional lexical features: Research Articles , 2007 .

[14]  Faiyaz Al Zamal,et al.  Using Social Media to Infer Gender Composition of Commuter Populations , 2012, Proceedings of the International AAAI Conference on Web and Social Media.

[15]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[16]  Scott Counts,et al.  Microblog credibility perceptions: comparing the USA and China , 2013, CSCW.

[17]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[18]  Philip S. Yu,et al.  Language independent gender classification on Twitter , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[19]  Jeffrey T. Hancock,et al.  Warrants and deception in computer mediated communication , 2010, Conference on Computer Supported Cooperative Work.

[20]  Jay F. Nunamaker,et al.  Using Speech Act Profiling for Deception Detection , 2004, ISI.

[21]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[22]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[23]  Alex Hai Wang,et al.  Machine Learning for the Detection of Spam in Twitter Networks , 2010, ICETE.

[24]  Michael Frankfurter,et al.  A First Look At Communication Theory , 2016 .

[25]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[26]  A. Lenhart,et al.  Teens, privacy and online social networks: How teens manage their online identities and personal information in the age of MySpace , 2007 .

[27]  Carolyn R. Miller,et al.  Blogging as Social Action: A Genre Analysis of the Weblog , 2004 .

[28]  Berkant Barla Cambazoglu,et al.  Chat Mining for Gender Prediction , 2006, ADVIS.

[29]  Guofei Gu,et al.  Analyzing spammers' social networks for fun and profit: a case study of cyber criminal ecosystem on twitter , 2012, WWW.

[30]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[31]  David Jurgens,et al.  That's What Friends Are For: Inferring Location in Online Social Media Platforms Based on Social Relationships , 2013, ICWSM.

[32]  Yao-Hua Tan,et al.  The role of trust and deception in virtual societies , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[33]  Danny Holten,et al.  Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data , 2006, IEEE Transactions on Visualization and Computer Graphics.

[34]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[35]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[36]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[37]  George M. Mohay,et al.  Language and Gender Author Cohort Analysis of E-mail for Computer Forensics , 2002 .

[38]  Henry Hexmoor,et al.  Towards deception in agents , 2003, AAMAS '03.

[39]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[40]  Philip S. Yu,et al.  Empirical Evaluation of Profile Characteristics for Gender Classification on Twitter , 2013, 2013 12th International Conference on Machine Learning and Applications.

[41]  Dong Nguyen,et al.  "TweetGenie: automatic age prediction from tweets" by D. Nguyen, R. Gravel, D. Trieschnigg, and T. Meder; with Ching-man Au Yeung as coordinator , 2013, LINK.

[42]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[43]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[44]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[45]  Sameer Singh,et al.  A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures , 2001, Lit. Linguistic Comput..

[46]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[47]  Mohamed F. Mokbel,et al.  VacationFinder: a tool for collecting, analyzing, and visualizing geotagged Twitter data to find top vacation spots , 2014, LBSN '14.

[48]  Lina Zhou,et al.  A Social Network Based Analysis of Deceptive Communication in Online Chat , 2011, WEB.

[49]  Hao Chen,et al.  A Quantitative Study of Forum Spamming Using Context-based Analysis , 2007, NDSS.

[50]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[51]  Jon Oberlander,et al.  Weblogs, genres and individual differences , 2005 .

[52]  John C. Paolillo,et al.  Gender and genre variation in weblogs , 2006 .

[53]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[54]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[55]  J. C. Judikis,et al.  Conversations on Community Theory , 2002 .

[56]  Philip S. Yu,et al.  Deception detection in Twitter , 2015, Social Network Analysis and Mining.

[57]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[58]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[59]  S. Argamon,et al.  Performing Gender: Automatic Stylistic Analysis of Shakespeare's Characters , 2006 .

[60]  Yanlei Wu,et al.  2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2014, Beijing, China, August 17-20, 2014 , 2014, ASONAM.

[61]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[62]  J. Nunamaker,et al.  Automating Linguistics-Based Cues for Detecting Deception in Text-Based Asynchronous Computer-Mediated Communications , 2004 .

[63]  Rick L. Wilson,et al.  Decision support for determining veracity via linguistic-based cues , 2009, Decis. Support Syst..

[64]  Santosh S. Vempala,et al.  Filtering spam with behavioral blacklisting , 2007, CCS '07.

[65]  Avner Caspi,et al.  Online Deception: Prevalence, Motivation, and Emotion , 2006, Cyberpsychology Behav. Soc. Netw..

[66]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[67]  J. Pennebaker,et al.  Lying Words: Predicting Deception from Linguistic Styles , 2003, Personality & social psychology bulletin.

[68]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[69]  Laura K. Guerrero,et al.  Close Encounters: Communication in Relationships , 2007 .

[70]  Lois Ann Scheidt,et al.  Bridging the gap: a genre analysis of Weblogs , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[71]  David Yarowsky,et al.  Hierarchical Bayesian Models for Latent Attribute Detection in Social Media , 2011, ICWSM.

[72]  Alessandro Vespignani,et al.  The Twitter of Babel: Mapping World Languages through Microblogging Platforms , 2012, PloS one.

[73]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[74]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[75]  Holtjona Galanxhi-Janaqi,et al.  Deception in cyberspace: A comparison of text-only vs. avatar-supported medium , 2007, Int. J. Hum. Comput. Stud..

[76]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[77]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[78]  Vern Paxson,et al.  Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse , 2013, USENIX Security Symposium.

[79]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[80]  Xianchao Zhang,et al.  Detecting Spam and Promoting Campaigns in the Twitter Social Network , 2012, 2012 IEEE 12th International Conference on Data Mining.