Who are the spoilers in social media marketing? Incremental learning of latent semantics for social spam detection

With the rise of social web, there has also been a great concern about the quality of user-generated content on social media sites (SMSs). Deceptive comments harm users’ trust in online social media and cause financial loss to firms. Previous studies use various features and classification algorithms to detect and filter social spam on several social media platforms. However, to the best of our knowledge, previous studies have not exploited both probabilistic topic modeling and incremental learning to detect social spam on SMSs. Thus, the main contribution of this paper is design of a novel detection methodology that combines topic- and user-based features to improve the effectiveness of social spam detection. The proposed methodology exploits a probabilistic generative model, namely the labeled latent Dirichlet allocation (L-LDA), for mining the latent semantics from user-generated comments, and an incremental learning approach for tackling the changing feature space. An experiment based on a large dataset extracted from YouTube demonstrates the effectiveness of our proposed methodology, which achieves an average accuracy of 91.17 % in social spam detection. Our statistical analysis reveals that topic-based features significantly improve social spam detection, which has significant implications for business practice.

[1]  J. Neyman On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection , 1934 .

[2]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[3]  David G. Stork,et al.  Pattern Classification , 1973 .

[4]  Tom M. Mitchell,et al.  Generalization as Search , 2002 .

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[7]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[8]  Paul E. Utgoff,et al.  ID5: An Incremental ID3 , 1987, ML Workshop.

[9]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[10]  Tony R. Martinez,et al.  ILA: Combining Inductive Learning with Prior Knowledge and Reasoning , 1995 .

[11]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[12]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[13]  Alexander F. Gelbukh,et al.  Chi-Square Classifier for Document Categorization , 2001, CICLing.

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[16]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[17]  N. Fairclough Analysing Discourse: Textual Analysis for Social Research , 2003 .

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[20]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[21]  Shabbir Ahmed,et al.  Word Stemming to Enhance Spam Filtering , 2004, CEAS.

[22]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[23]  Eugene W. Myers,et al.  AnO(ND) difference algorithm and its variations , 1986, Algorithmica.

[24]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[25]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[26]  Danah Boyd,et al.  Profiles as Conversation: Networked Identity Performance on Friendster , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[27]  Judith S. Donath,et al.  Is Britney Spears Spam? , 2007, CEAS.

[28]  Abdelwadood Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System , 2007 .

[29]  Abdelwadood Moh'd A. Mesleh,et al.  Chi Square Feature Extraction Based Svms Arabic Text Categorization System , 2016, ICSOFT.

[30]  Markus Jakobsson,et al.  Social phishing , 2007, CACM.

[31]  Hsinchun Chen,et al.  CyberGate: A Design Framework and System for Text Analysis of Computer-Mediated Communication , 2008, MIS Q..

[32]  Yun Chi,et al.  Detecting splogs via temporal dynamics using self-similarity analysis , 2008, TWEB.

[33]  Kevin Borders,et al.  Social networks and context-aware spam , 2008, CSCW.

[34]  Abdulmohsen Al-Thubaity,et al.  Automatic Arabic Text Classification , 2008 .

[35]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[36]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[37]  Ciro Cattuto,et al.  Social spam detection , 2009, AIRWeb '09.

[38]  Kyung Hyan Yoo,et al.  Comparison of Deceptive and Truthful Travel Reviews , 2009, ENTER.

[39]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[40]  Min Zhu,et al.  Identifying functional miRNA-mRNA regulatory modules with correspondence latent dirichlet allocation , 2010, Bioinform..

[41]  Sergej Sizov,et al.  GeoFolk: latent spatial semantics in web 2.0 social media , 2010, WSDM '10.

[42]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[43]  Liang Zheng LDA-based Model for Online Topic Evolution Mining , 2010 .

[44]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[45]  Derek Greene,et al.  Distortion as a validation criterion in the identification of suspicious reviews , 2010, SOMA '10.

[46]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[47]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[48]  Noémie Elhadad,et al.  An Unsupervised Aspect-Sentiment Model for Online Reviews , 2010, NAACL.

[49]  D. Sculley,et al.  Combined regression and ranking , 2010, KDD.

[50]  Jiebo Luo,et al.  SocialSpamGuard: A Data Mining-Based Spam Detection System for Social Media Networks , 2011, Proc. VLDB Endow..

[51]  Xin Yuan,et al.  An empirical study of behavioral characteristics of spammers: Findings and implications , 2011, Comput. Commun..

[52]  Dick van Marle IP telephony shifts from unified communications to social media , 2011 .

[53]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[54]  R. Chandramouli,et al.  Emerging social media threats: Technology and policy perspectives , 2011, 2011 Second Worldwide Cybersecurity Summit (WCS).

[55]  Jong Kim,et al.  Spam Filtering in Twitter Using Sender-Receiver Relationship , 2011, RAID.

[56]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[57]  Alok N. Choudhary,et al.  Towards Online Spam Filtering in Social Networks , 2012, NDSS.

[58]  Elizabeth F. Churchill,et al.  Automatic identification of personal insults on social news sites , 2012, J. Assoc. Inf. Sci. Technol..

[59]  Lin Liu,et al.  Detecting Spam in Chinese Microblogs - A Study on Sina Weibo , 2012, 2012 Eighth International Conference on Computational Intelligence and Security.

[60]  Arjun Mukherjee,et al.  Spotting fake reviewer groups in consumer reviews , 2012, WWW.

[61]  Padraig Cunningham,et al.  Identifying Discriminating Network Motifs in YouTube Spam , 2012, ArXiv.

[62]  Padraig Cunningham,et al.  Network Analysis of Recurring YouTube Spam Campaigns , 2012, ICWSM.

[63]  Mukesh K. Mohania,et al.  Cloud Computing and Big Data Analytics: What Is New from Databases Perspective? , 2012, BDA.

[64]  Srinivasan Venkatesh,et al.  Battling the Internet water army: Detection of hidden paid posters , 2011, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[65]  Byung-Won On,et al.  Social Spam Discovery Using Bayesian Network Classifiers Based on Feature Extractions , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[66]  Po-Ching Lin,et al.  A study of effective features for detecting long-surviving Twitter spam accounts , 2013, 2013 15th International Conference on Advanced Communications Technology (ICACT).

[67]  Lawrence D. Fu,et al.  A comprehensive empirical comparison of modern supervised classification and feature selection methods for text categorization , 2014, J. Assoc. Inf. Sci. Technol..

[68]  Weili Wu,et al.  Maximizing rumor containment in social networks with constrained time , 2014, Social Network Analysis and Mining.

[69]  Zhijun Yan,et al.  A domain-feature enhanced classification model for the detection of Chinese phishing e-Business websites , 2014, Inf. Manag..

[70]  Arnon Rungsawang,et al.  Adaptive Learning Ant Colony Optimization for Web Spam Detection , 2014, ICCSA.

[71]  Dongsong Zhang,et al.  Discourse cues to deception in the case of multiple receivers , 2014, Inf. Manag..

[72]  Calton Pu,et al.  SPADE: a social-spam analytics and detection framework , 2014, Social Network Analysis and Mining.

[73]  Li Chen,et al.  An Adaptive Fusion Algorithm for Spam Detection , 2014, IEEE Intelligent Systems.

[74]  Cheng-Hao Tsai,et al.  Incremental and decremental training for linear classification , 2014, KDD.

[75]  Wenji Mao,et al.  Supporting Global Collective Intelligence via Artificial Intelligence , 2014, IEEE Intell. Syst..

[76]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[77]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.