WARCProcessor: An Integrative Tool for Building and Management of Web Spam Corpora

In this work we present the design and implementation of WARCProcessor, a novel multiplatform integrative tool aimed to build scientific datasets to facilitate experimentation in web spam research. The developed application allows the user to specify multiple criteria that change the way in which new corpora are generated whilst reducing the number of repetitive and error prone tasks related with existing corpus maintenance. For this goal, WARCProcessor supports up to six commonly used data sources for web spam research, being able to store output corpus in standard WARC format together with complementary metadata files. Additionally, the application facilitates the automatic and concurrent download of web sites from Internet, giving the possibility of configuring the deep of the links to be followed as well as the behaviour when redirected URLs appear. WARCProcessor supports both an interactive GUI interface and a command line utility for being executed in background.

[1]  Xinchang Zhang,et al.  Evaluating Web Content Quality via Multi-scale Features , 2013, ArXiv.

[2]  Le Zhang,et al.  SybilSCAR: Sybil detection in online social networks via local rule based propagation , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[3]  Akebo Yamakami,et al.  Machine Learning Methods for Spamdexing Detection , 2013 .

[4]  Martín Abadi,et al.  deSEO: Combating Search-Result Poisoning , 2011, USENIX Security Symposium.

[5]  Hua Shen,et al.  Detecting spam reviewers by combing reviewer feature and relationship , 2014, Proceedings 2014 International Conference on Informative and Cybernetics for Computational Social Systems (ICCSS).

[6]  Naomie Salim,et al.  Detection of review spam: A survey , 2015, Expert Syst. Appl..

[7]  Kaiqiang Guo,et al.  Countering Web Spam of Link-based Ranking Based on Link Analysis , 2011 .

[8]  András A. Benczúr,et al.  The Classification Power of Web Features , 2014, Internet Math..

[9]  Ting Yu,et al.  Detecting Opinion Spammer Groups Through Community Discovery and Sentiment Analysis , 2015, DBSec.

[10]  Ashutosh Kumar Singh,et al.  Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification , 2015 .

[11]  Liangxiu Han,et al.  Scalable Online Incremental Learning for Web Spam Detection , 2012 .

[12]  Vali Derhami,et al.  Applying reinforcement learning for web pages ranking algorithms , 2013, Appl. Soft Comput..

[13]  Brian D. Davison,et al.  Adversarial Web Search , 2011, Found. Trends Inf. Retr..

[14]  Jinyuan Jia,et al.  Random Walk Based Fake Account Detection in Online Social Networks , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[15]  András A. Benczúr,et al.  Cross-lingual web spam classification , 2013, WWW.

[16]  Prateek Mittal,et al.  SybilBelief: A Semi-Supervised Learning Approach for Structure-Based Sybil Detection , 2013, IEEE Transactions on Information Forensics and Security.

[17]  Walid Magdy,et al.  Unsupervised adaptive microblog filtering for broad dynamic topics , 2016, Inf. Process. Manag..

[18]  Alessandro Moschitti,et al.  Multi-lingual opinion mining on YouTube , 2016, Inf. Process. Manag..

[19]  Vidyasagar Potdar,et al.  Spam 2.0 State of the Art , 2012, Int. J. Digit. Crime Forensics.

[20]  Preslav Nakov,et al.  SemEval-2015 Task 10: Sentiment Analysis in Twitter , 2015, *SEMEVAL.

[21]  Akshi Kumar,et al.  Sentiment Analysis on Twitter , 2012 .

[22]  Vidyasagar Potdar,et al.  Toward spam 2.0: An evaluation of Web 2.0 anti-spam methods , 2009, 2009 7th IEEE International Conference on Industrial Informatics.

[23]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[24]  V. Potdar,et al.  A survey of awareness, knowledge and perception of online spam , 2012, 2012 7th International Conference on Computing and Convergence Technology (ICCCT).

[25]  Paolo Rosso,et al.  Detecting positive and negative deceptive opinions using PU-learning , 2015, Inf. Process. Manag..

[26]  Xing Xie,et al.  Robust Spammer Detection in Microblogs: Leveraging User Carefulness , 2017, TIST.

[27]  Andrew Hardie,et al.  CQPweb — combining power, flexibility and usability in a corpus analysis tool , 2012 .

[28]  Nasser Yazdani,et al.  A3CRank: An adaptive ranking method based on connectivity, content and click-through data , 2010, Inf. Process. Manag..

[29]  Claire Cardie,et al.  Negative Deceptive Opinion Spam , 2013, NAACL.

[30]  Fidel Cacheda,et al.  Analysis and Detection of Web Spam by Means of Web Content , 2012, IRFC.

[31]  Arjen P. de Vries,et al.  The strange case of reproducibility versus representativeness in contextual suggestion test collections , 2016, Information Retrieval Journal.

[32]  Jussara M. Almeida,et al.  A genetic programming framework to schedule webpage updates , 2014, Information Retrieval Journal.

[33]  Shekoofeh Ghiam,et al.  A Survey on Web Spam Detection Methods: Taxonomy , 2012, ArXiv.

[34]  Farida Ridzuan,et al.  Awareness, Knowledge and Perception of Online Spam , 2013 .

[35]  A. Chandra,et al.  A Survey on Web Spam and Spam 2.0 , 2014 .

[36]  Izzat Alsmadi,et al.  Using Machine Learning Algorithms to Detect Content-based Arabic Web Spam , 2012 .

[37]  Hans Radder,et al.  Experimental Reproducibility and the Experimenters' Regress , 1992, PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association.

[38]  Calton Pu,et al.  SPADE: a social-spam analytics and detection framework , 2014, Social Network Analysis and Mining.