Detecting Malicious Web Scraping Activity: A Study with Diverse Detectors

We present results on the use of diverse monitoring tools for the detection of malicious web scraping activity. We have carried out an analysis of a real dataset of Apache HTTP Access logs for an e-commerce application provided by a large multinational IT provider for the global travel and tourism industry. Two tools have been used to detect scraping activities based on the HTTP requests: a commercial tool, and an in-house tool called Arcane. We show the benefits that can be achieved through the use of both systems, in terms of overall sensitivity and specificity, and we discuss the potential sources of diversity between the tool's alert patterns.

[1]  Peter G. Bishop,et al.  Diversity for Security: A Study with Off-the-Shelf AntiVirus Engines , 2011, 2011 IEEE 22nd International Symposium on Software Reliability Engineering.

[2]  Padraig Cunningham,et al.  Diversity versus Quality in Classification Ensembles Based on Feature Selection , 2000, ECML.

[3]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[4]  Alysson Neves Bessani,et al.  Analysis of operating system diversity for intrusion tolerance , 2014, Softw. Pract. Exp..

[5]  Marios D. Dikaiakos,et al.  Web robot detection: A probabilistic reasoning approach , 2009, Comput. Networks.

[6]  William H. Sanders,et al.  Probabilistic validation of an intrusion-tolerant replication system , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[7]  Dave E. Eckhardt,et al.  A Theoretical Basis for the Analysis of Multiversion Software Subject to Coincident Errors , 1985, IEEE Transactions on Software Engineering.

[8]  Bev Littlewood,et al.  Conceptual Modeling of Coincident Failures in Multiversion Software , 1989, IEEE Trans. Software Eng..

[9]  William H. Sanders,et al.  Dependability and Performance Evaluation of Intrusion-Tolerant Server Architectures , 2003, LADC.

[10]  Areej Al-Bataineh,et al.  Analysis and detection of malicious data exfiltration in web traffic , 2012, 2012 7th International Conference on Malicious and Unwanted Software.

[11]  Lorenzo Strigini,et al.  Software Diversity as a Measure for Reducing Development Risk , 2014, 2014 Tenth European Dependable Computing Conference.

[12]  Aijun An,et al.  Feature evaluation for web crawler detection with data mining techniques , 2012, Expert Syst. Appl..

[13]  Bev Littlewood,et al.  Redundancy and Diversity in Security , 2004, ESORICS.