Feature evaluation for web crawler detection with data mining techniques

Highlights? We examine the effect of applying 7 data mining classifiers on web server logs. ? We automate the detection of well-behaved and malicious crawlers. ? We automate the detection of human and unknown visitors. ? We introduce 2 novel features to facilitate classification of web site visitors. ? Results show improvement in recall and precision of detection. Distributed Denial of Service (DDoS) is one of the most damaging attacks on the Internet security today. Recently, malicious web crawlers have been used to execute automated DDoS attacks on web sites across the WWW. In this study we examine the effect of applying seven well-established data mining classification algorithms on static web server access logs in order to: (1) classify user sessions as belonging to either automated web crawlers or human visitors and (2) identify which of the automated web crawlers sessions exhibit 'malicious' behavior and are potentially participants in a DDoS attack. The classification performance is evaluated in terms of classification accuracy, recall, precision and F1 score. Seven out of nine vector (i.e. web-session) features employed in our work are borrowed from earlier studies on classification of user sessions as belonging to web crawlers. However, we also introduce two novel web-session features: the consecutive sequential request ratio and standard deviation of page request depth. The effectiveness of the new features is evaluated in terms of the information gain and gain ratio metrics. The experimental results demonstrate the potential of the new features to improve the accuracy of data mining classifiers in identifying malicious and well-behaved web crawler sessions.

[1]  Haibin Liu,et al.  Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users' future requests , 2007, Data Knowl. Eng..

[2]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[3]  Alex Talevski,et al.  Web Spambot Detection Based on Web Navigation Behaviour , 2010, 2010 24th IEEE International Conference on Advanced Information Networking and Applications.

[4]  Lars Schmidt-Thieme,et al.  Web Robot Detection - Preprocessing Web Logfiles for Robot Detection , 2005 .

[5]  Marios D. Dikaiakos,et al.  Web robot detection: A probabilistic reasoning approach , 2009, Comput. Networks.

[6]  Xiaozhu Lin,et al.  An Automatic Scheme to Categorize User Sessions in Modern HTTP Traffic , 2008, IEEE GLOBECOM 2008 - 2008 IEEE Global Telecommunications Conference.

[7]  John Langford,et al.  CAPTCHA: Using Hard AI Problems for Security , 2003, EUROCRYPT.

[8]  Shichao Zhang,et al.  Identifying interesting visitors through Web log classification , 2005, IEEE Intelligent Systems.

[9]  Jelena Mirkovic,et al.  Modeling Human Behavior for Defense Against Flash-Crowd Attacks , 2009, 2009 IEEE International Conference on Communications.

[10]  Tsuhan Chen,et al.  Malicious web content detection by machine learning , 2010, Expert Syst. Appl..

[11]  Shun-Zheng Yu,et al.  Web Robot Detection Based on Hidden Markov Model , 2006, 2006 International Conference on Communications, Circuits and Systems.

[12]  Bohn Stafleu van Loghum,et al.  Online … , 2002, LOG IN.

[13]  Kang-Won Lee,et al.  Securing Web Service by Automatic Robot Detection , 2006, USENIX Annual Technical Conference, General Track.

[14]  Jun-Lin Lin Detection of cloaked web spam by using tag-based methods , 2009, Expert Syst. Appl..

[15]  Vipin Kumar,et al.  Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  Thomas H. Wonnacott,et al.  Introductory Statistics , 2007, Technometrics.

[18]  C. Wilson Botnets, Cybercrime, and Cyberterrorism: Vulnerabilities and Policy Issues for Congress , 2008 .

[19]  Shunzheng Yu,et al.  Monitoring the Application-Layer DDoS Attacks for Popular Websites , 2009, IEEE/ACM Transactions on Networking.

[20]  Swapna S. Gokhale,et al.  Web robot detection techniques: overview and limitations , 2010, Data Mining and Knowledge Discovery.