Duplicate Detection for Identifying Social Spam in Microblogs

As an important kind of social media, microblog has become an important source of opinion mining and collective behavior study. However, social spams may affect the analytical results greatly. This paper focuses on the problem of identifying potential social spammers who copy pieces of information from others. An improved locality-sensitive hashing based method is used for detecting duplicated tweets. Intensive empirical study over a real-life microblog dataset crawled from Sina Weibo, one of the most popular microblogging services, is conducted. The characteristics of potential spammers and their behaviors are analyzed.

[1]  Ciro Cattuto,et al.  Social spam detection , 2009, AIRWeb '09.

[2]  Evaggelia Pitoura,et al.  One is enough: distributed filtering for duplicate elimination , 2011, CIKM '11.

[3]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[4]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[5]  Aoying Zhou,et al.  Towards modeling popularity of microblogs , 2013, Frontiers of Computer Science.

[6]  Aoying Zhou,et al.  Social media data analysis for revealing collective behaviors , 2012, KDD.

[7]  Stéphane Marchand-Maillet,et al.  Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2010, Geneva, Switzerland, July 19-23, 2010 , 2010, SIGIR.

[8]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[9]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[10]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[11]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[12]  Karen Rose,et al.  What is Twitter , 2009 .

[13]  Jure Leskovec,et al.  Inferring networks of diffusion and influence , 2010, KDD.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[16]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[17]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[18]  Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011 , 2011, CIKM.

[19]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[20]  Peter F. Patel-Schneider,et al.  Proceedings of the 16th international conference on World Wide Web , 2007, WWW 2007.

[21]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[22]  Bernardo A. Huberman,et al.  What Trends in Chinese Social Media , 2011, ArXiv.

[23]  Rina Panigrahy,et al.  Entropy based nearest neighbor search in high dimensions , 2005, SODA '06.

[24]  Zhe Wang,et al.  Ferret: a toolkit for content-based similarity search of feature-rich data , 2006, EuroSys.

[25]  Hisashi Koga,et al.  Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing , 2007, Knowledge and Information Systems.

[26]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[27]  Efstathios Stamatatos Plagiarism detection based on structural information , 2011, CIKM '11.

[28]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[29]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[30]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[31]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[32]  Rizal Setya Perdana What is Twitter , 2013 .

[33]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[34]  Xuanjing Huang,et al.  Learning hash codes for efficient content reuse detection , 2012, SIGIR '12.

[35]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[36]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[37]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[38]  Dmitri Loguinov,et al.  Probabilistic near-duplicate detection using simhash , 2011, CIKM '11.

[39]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[40]  Srinivasan Parthasarathy,et al.  Bayesian Locality Sensitive Hashing for Fast Similarity Search , 2011, Proc. VLDB Endow..

[41]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[42]  Timothy W. Finin,et al.  Why We Twitter: An Analysis of a Microblogging Community , 2009, WebKDD/SNA-KDD.

[43]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).