Detecting blog spam hashtags using topic modeling

Tremendous amounts of data are generated daily. Accordingly, unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, attempts to gain profits by distorting text data maliciously or non-maliciously are also increasing. In this sense, various types of spam detection techniques have been studied to prevent the side effects of spamming. The most representative studies include e-mail spam detection, web spam detection, and opinion spam detection. "Spam" is recognized on the basis of three characteristics and actions: (1) if a certain user is recognized as a spammer, then all content created by that user should be recognized as spam; (2) if certain content is exposed to other users (regardless of the users' intention), then content is recognized as spam; and (3) any content that contains malicious or non-malicious false information is recognized as spam. Many studies have been performed to solve type (1) and type (2) spamming by analyzing various metadata, such as user networks and spam words. In the case of type (3), however, relatively few studies have been conducted because it is difficult to determine the veracity of a certain word or information. In this study, we regard a hashtag that is irrelevant to the content of a blog post as spam and devise a methodology to detect such spam hashtags.

[1]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[2]  Niu Yan,et al.  Detecting Spam on Sina Weibo , 2013, CloudCom 2013.

[3]  Namgyu Kim,et al.  User-Perspective Issue Clustering Using Multi-Layered Two-Mode Network Analysis , 2014 .

[4]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[5]  Jong Kim,et al.  Spam Filtering in Twitter Using Sender-Receiver Relationship , 2011, RAID.

[6]  Tong Zhang,et al.  Fundamentals of Predictive Text Mining , 2010, Texts in Computer Science.

[7]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[8]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[9]  Aixin Sun,et al.  Effect of Spam on Hashtag Recommendation for Tweets , 2016, WWW.

[10]  Namgyu Kim,et al.  Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis , 2016 .

[11]  Jun Zhang,et al.  Spammers Are Becoming "Smarter" on Twitter , 2016, IT Professional.

[12]  Jong Kim,et al.  WarningBird: A Near Real-Time Detection System for Suspicious URLs in Twitter Stream , 2013, IEEE Transactions on Dependable and Secure Computing.

[13]  Petros Xanthopoulos,et al.  Hashtag hijacking: What it is, why it happens and how to avoid it , 2016 .

[14]  Namgyu Kim,et al.  Improving Performance of Recommendation Systems Using Topic Modeling , 2015 .

[15]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[16]  Jong Woo Kim,et al.  Characteristics on Inconsistency Pattern Modeling as Hybrid Data Mining Techniques , 2008 .

[17]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[18]  Gianluca Stringhini,et al.  COMPA: Detecting Compromised Accounts on Social Networks , 2013, NDSS.

[19]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[20]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[21]  Namgyu Kim,et al.  Interest-based Customer Segmentation Methodology Using Topic Modeling , 2015 .

[22]  Kan Zheng,et al.  Three-Way Decisions Solution to Filter Spam Email: An Empirical Study , 2012, RSCTC.

[23]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[24]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[25]  Junhyung Park,et al.  A Methodology for Analyzing Public Opinion about Science and Technology Issues Using Text Analysis , 2015 .

[26]  Lei Zhang,et al.  Sentiment Analysis and Opinion Mining , 2017, Encyclopedia of Machine Learning and Data Mining.

[27]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[28]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[29]  Namgyu Kim,et al.  A Multi-Dimensional Issue Clustering from the Perspective Consumers' Interests and R&D , 2015 .