Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection

Social networking has been used widely by millions of people over the world. It has become the most popular way for people who want to connect and interact online with their friends. Currently, there are many social networking sites, for instance, Facebook, My Space, and Twitter, with a huge number of active users. Therefore, they are also good places for spammers or cheaters who want to steal the personal information of users or advertise their products. Recently, many proposed methods are applied to detect spam comments on social networks with different techniques. In this paper, we propose a similarity-based method that combines fingerprinting technique with trie-tree data structure and meet-in-the-middle approach in order to achieve a higher accuracy in spam comments detection. Using our proposed approach, we are able to detect around 98% spam comments in our dataset.

[1]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Jeffrey D. Ullman,et al.  Mining of Massive Datasets: Data Mining , 2011 .

[4]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[5]  J. Prasanna Kumar,et al.  Near-Duplicate Web Page Detection: An Efficient Approach Using Clustering, Sentence Feature and Fingerprinting , 2013, Int. J. Comput. Intell. Syst..

[6]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[7]  A. Govardhan,et al.  Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling , 2010, ADMA.

[8]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[9]  Ceriel J. H. Jacobs,et al.  Parsing Techniques - A Practical Guide , 2007, Monographs in Computer Science.

[10]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[11]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[12]  Sung-Ryul Kim,et al.  Graph-based KNN Algorithm for Spam SMS Detection , 2013, J. Univers. Comput. Sci..

[13]  Caitlin Sadowski SimHash : Hash-based Similarity Detection , 2007 .

[14]  Ian Witten,et al.  Data Mining , 2000 .

[15]  Eiríkur Rögnvaldsson,et al.  A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI) , 2008, GoTAL.

[16]  James W. Cooper,et al.  A novel method for detecting similar documents , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[17]  Fathy E. Eassa,et al.  Near Duplicate Document Detection Survey , 2012 .